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Building  intelligent  systems  that  can  learn  about  and  reason  with  causes  and 
effects  is  a  fundamental  task  in  artificial  intelligence.  This  dissertation  ad¬ 
dresses  various  issues  in  causal  reasoning  and  learning  in  the  framework  of  causal 
Bayesian  networks.  We  offer  a  complete  characterization  of  the  set  of  distri¬ 
butions  that  could  be  induced  by  local  interventions  on  variables  governed  by 
a  causal  Bayesian  network.  The  characterization  provides  a  symbolic  inferential 
tool  for  tasks  in  causal  reasoning.  We  propose  a  new  method  of  discovering  causal 
structures,  based  on  the  detection  of  local,  spontaneous  changes  in  the  underlying 
data- generating  model.  We  show  that  the  use  of  information  about  local  changes 
increases  our  power  of  causal  discovery  beyond  the  limits  set  by  independence 
equivalence  that  governs  Bayesian  networks.  In  the  presence  of  unmeasured  vari¬ 
ables,  causal  models  may  impose  non-independence  functional  constraints  and 
no  general  criterion  is  previously  available  for  finding  those  constraints.  We  offer 
a  systematic  method  of  identifying  functional  constraints,  which  facilitates  the 
task  of  testing  causal  models.  Causal  effects  permit  us  to  predict  how  systems 
would  respond  to  actions  or  policy  decisions.  We  establish  new  graphical  crite¬ 
ria  for  ensuring  the  identification  of  causal  effects  that  generalize  and  simplify 
existing  criteria  in  the  literature,  and  we  provide  computational  procedures  for 
systematically  identifying  causal  effects.  Assessing  the  probability  of  causation, 


that  is,  the  likelihood  that  one  event  was  the  cause  of  another,  guides  much  of 
what  we  understand  about  and  how  we  act  in  the  world.  We  show  how  useful 
information  on  the  probabilities  of  causation  can  be  extracted  from  empirical 
data,  and  how  data  from  both  experimental  and  nonexperimental  studies  can  be 
combined  to  yield  information  that  neither  study  alone  can  provide.  Our  results 
clarify  the  basic  assumptions  that  must  be  made  before  statistical  measures  such 
as  the  excess-risk-ratio  could  be  used  for  assessing  attributional  quantities  such 
as  the  probability  of  causation. 


CHAPTER  1 


Causal  Models 


1.1  Introduction 

A  major  challenge  in  artificial  intelligence  is  to  build  autonomous  intelligent  sys¬ 
tems  that  can  make  sense  of  their  environment,  so  that  they  can  respond  to 
unexpected  events  or  changes  in  the  environment.  Traditional  probabilistic  and 
statistical  approaches  assume  a  static  time- invariant  environment,  and  they  can 
not  predict  what  happens  if  the  environment  changes  or  some  external  actions 
occur.  Such  predictions  are  not  discernible  from  probabilistic  information;  they 
rest  on  causal  relationships.  We  human  beings  communicate  about  the  world  in 
the  language  of  causation,  and  we  would  like  to  build  intelligent  systems  that 
understand  causal  talk.  We  must  build  intelligent  systems  that  can  learn  about 
and  reason  with  causes  and  effects.  The  two  challenges  that  we  will  face  are: 

1.  How  should  an  intelligent  agent  acquire  causal  information  from  the  envi¬ 
ronment? 

2.  How  should  an  intelligent  agent  process  available  causal  information? 

This  dissertation  addresses  both  of  the  problems  in  the  framework  of  causal 
Bayesian  networks,  also  called  causal  models \  which  provide  a  mathematical 
language  for  representing  and  reasoning  about  causal  relations. 

1.2  Causal  Models  and  Interventions 

The  use  of  causal  models  for  encoding  distributional  and  causal  assumptions 
is  now  fairly  standard  (see,  for  example,  [Pea88,  SGS93,  HS95,  Jor98,  GPR99, 
LauOO,  PeaOO,  Daw02]).  The  simplest  such  model,  called  Markovian ,  consists  of  a 
directed  acyclic  graph  (DAG)  G,  called  a  causal  graph,  over  a  set  V  =  {Hi, . . . ,  Vn} 
of  vertices,  representing  variables  of  interest,  and  a  set  of  directed  edges,  or  ar¬ 
rows,  that  connect  these  vertices  (see  Figure  1.1  for  an  example  causal  graph). 

throughout  this  dissertation,  we  will  refer  to  the  terms  causal  model  and  causal  Bayesian 
network  interchangeably. 
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Unknown  factor 


Smoking  Tar  in  Cancer 
lungs 

Figure  1.1:  A  typical  causal  graph. 


The  interpretation  of  a  causal  graph  has  two  components,  probabilistic  and 
causal.  The  probabilistic  interpretation  views  the  arrows  as  representing  proba¬ 
bilistic  dependencies  among  the  corresponding  variables,  and  the  missing  arrows 
as  representing  conditional  independence  assertions:  Each  variable  is  independent 
of  all  its  non-descendants  given  its  direct  parents  in  the  graph.2  These  assump¬ 
tions  amount  to  asserting  that  the  joint  probability  function  P(v)  —  P(vi, . . . ,  vn) 
factorizes  according  to  the  product 

P(v)  =  X\P{vi\Pai)  l1-1) 

i 

where  Paj  denotes  the  set  of  parents  of  variable  Vt  in  the  graph.3 

The  causal  interpretation  views  the  arrows  as  representing  causal  influences 
between  the  corresponding  variables.  In  this  interpretation,  the  factorization  of 
(1.1)  still  holds,  but  the  factors  are  further  assumed  to  represent  autonomous 
data-generation  processes,  that  is,  each  conditional  probability  P{vi\paj)  repre¬ 
sents  a  stochastic  process  by  which  the  values  of  V  are  chosen  in  response  to 
the  values  pat  (previously  chosen  for  Vj’s  parents),  and  the  stochastic  variation  of 
this  assignment  is  assumed  independent  of  the  variations  in  all  other  assignments. 
Moreover,  each  assignment  process  remains  invariant  to  possible  changes  in  the 
assignment  processes  that  govern  other  variables  in  the  system. 

This  modularity  assumption  enables  us  to  predict  the  effects  of  interventions, 
whenever  interventions  are  described  as  specific  modifications  of  some  factors  in 
the  product  of  (1.1).  The  simplest  such  intervention,  called  atomic ,  involves  fixing 

2We  use  family  relationships  such  as  “parents,”  “children,”  “ancestors,”  and  “descendants,” 
to  describe  the  obvious  graphical  relationships.  For  example,  we  say  that  V,  is  a  parent  of  Vj 
if  there  is  an  arrow  from  node  Vj  to  Vj.  Vi  — >  Vj . 

3We  use  uppercase  letters  to  represent  variables  or  sets  of  variables,  and  use  corresponding 
lowercase  letters  to  represent  their  values  (instantiations).  For  example,  pai  represents  an 
instantiation  of  Poj. 
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a  set  T  of  variables  to  some  constants  T  =  t,  which  yields  the  post-intervention 
distribution4 

P  ,  ,  f  nmm  P(vi\pa,i)  for  all  v  consistent  with  T  =  t.  . 

*  V  \  0  for  all  v  inconsistent  with  T  =  t. 


Eq.  (1.2)  represents  a  truncated  factorization  of  (1.1),  with  factors  corresponding 
to  the  manipulated  variables  removed.  This  truncation  follows  immediately  from 
(1.1)  since,  assuming  modularity,  the  post-intervention  probabilities  P(vi\pa,i) 
corresponding  to  variables  in  T  are  either  1  or  0,  while  those  corresponding  to 
unmanipulated  variables  remain  unaltered.5  If  T  stands  for  a  set  of  treatment 
variables  and  Y  for  an  outcome  variable  in  V\T,  then  Eq.  (1.2)  permits  us  to  cal¬ 
culate  the  probability  Pt(y)  that  event  Y  =  y  would  occur  if  treatment  condition 
T  —  t  were  enforced  uniformly  over  the  population.  This  quantity,  often  called 
the  causal  effect  of  T  on  Y,  is  what  we  normally  assess  in  a  controlled  experiment 
with  T  randomized,  in  which  the  distribution  of  Y  is  estimated  for  each  level  t 
of  T. 

We  see  from  Eq.  (1.2)  that  the  model  needed  for  predicting  the  effect  of 
interventions  requires  the  specification  of  three  elements 


M  =  (V,  G,  P(vi\pai)) 


where  (i)  V  =  {Vi, . . . ,  Vn}  is  a  set  of  variables,  (ii)  G  is  a  directed  acyclic  graph 
with  nodes  corresponding  to  the  elements  of  V,  and  (iii)  P(vj\pai),i  =  1, . . . ,  n,  is 
the  conditional  probability  of  variable  Vt  given  its  parents  in  G.  Since  P{vffpat  )  is 
estimable  from  nonexperimental  data  whenever  V  is  observed,  we  see  that,  given 
the  causal  graph  G,  all  causal  effects  are  estimable  from  the  data  as  well. 

1.3  Causal  Models  with  Hidden  Variables 

Our  ability  to  estimate  Pt(v )  from  nonexperimental  data  is  severely  curtailed 
when  some  variables  in  a  Markovian  causal  model  are  unobserved.  We  call 
unobserved  variables  hidden  or  latent  variables.  If  two  or  more  variables  in 
V  are  affected  by  unobserved  confounders,  the  presence  of  such  confounders 
would  not  permit  the  decomposition  in  (1.1).  Letting  V  —  {Vi,...,Vn}  and 
U  =  {Uu  . . .  ,Un'}  stand  for  the  sets  of  observed  and  hidden  variables,  respec- 

4[Pea95a,  PeaOO]  used  the  notation  P(v\set(t)),  P(v\do(t)),  or  P{v\t)  for  the  post¬ 
intervention  distribution,  while  [LauOO]  used  _P(v||f). 

5Eq.  (1.2)  was  named  “Manipulation  Theorem”  in  [SGS93],  and  is  also  implicit  in  Robins’ 
(1987)  G-computation  formula. 
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tively,  the  observed  probability  distribution,  P(v),  becomes  a  mixture  of  prod¬ 
ucts: 


P(P>  =  E  n  P(Vi\pOivi)  1 1  P (V’ilpO’u.i)  (1’3) 

U  {i\Vi&V}  {i\UieU} 


where  PaVi  and  PaUi  stand  for  the  sets  of  parents  of  Vi  and  U%  respectively,  and  the 
summation  ranges  over  all  the  U  variables.  The  post-intervention  distribution,6 
likewise,  will  be  given  as  a  mixture  of  truncated  products 


Pt(v)  = 


Yju  Wmm  p(.vi\Pavi )  EL  P{ui\paUi)  v  consistent  with  t. 

0  v  inconsistent  with  t. 


(1.4) 


And,  the  question  of  identifiability  arises,  i.e. ,  whether  it  is  possible  to  express 
Pt(y )  as  a  function  of  the  observed  distribution  P(v).  Clearly,  given  a  causal 
model  M  and  any  two  sets  T  and  S  in  V,  Pt{s)  can  be  determined  unambiguously 
using  (1.4).  The  question  of  identifiability  is  whether  a  given  causal  effect  Pt(s) 
can  be  determined  uniquely  from  the  distribution  P(v)  of  the  observed  variables, 
and  is  thus  independent  of  the  unknown  quantities,  P{vi\paVi)  and  P(ui\paUi), 
that  involve  elements  of  U. 


Definition  1  [Causal- Effect  Identifiability \ 

The  causal  effect  of  a  set  of  variables  T  on  a  disjoint  set  of  variables  S  is  said  to  be 
identifiable  from  a  graph  G  if  the  quantity  Pt(s )  can  be  computed  uniquely  from 
any  positive  probability  of  the  observed  variables — that  is,  if  PtMl(s)  =  PtM2(s) 
for  every  pair  of  models  M\  and  M2  with  PMl(y)  —  PM'2(v)  >  0  and  G{Mfi)  — 
G(M2)  =  G. 


In  other  words,  given  the  causal  graph  G,  the  quantity  Pt(s)  can  be  determined 
from  the  observed  distribution  P(v)  alone;  the  details  of  M  are  irrelevant. 

If,  in  a  Markovian  model  with  hidden  variables,  each  hidden  variable  is  a  root 
node  with  exactly  two  observed  children,  then  the  corresponding  model  is  called 
a  semi- Markovian  model.  In  a  semi-Markovian  model,  the  observed  probability 
distribution  P(v)  in  Eq.(1.3)  can  be  written  as 

p(v) = e  n  ui)  n p ^  (L5) 

u  i  i 


where  Pa,-  and  P*  stand  for  the  sets  of  the  observed  and  unobserved  parents  of 
Vi  respectively.  The  post-intervention  distribution  is  then  given  by 


Pt(v) 


J2u  P(vi\pai,ul)  n,  P(.Ui)  v  consistent  with  t. 

0  v  inconsistent  with  t. 


(1.6) 


6We  only  consider  interventions  on  observed  variables. 
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Figure  1.2:  A  semi-Markovian  model. 


It  is  convenient  to  represent  a  semi-Markovian  model  with  a  causal  graph 
G  that  does  not  show  the  elements  of  U  explicitly  but,  instead,  represents  the 
confounding  effects  of  U  variables  using  bidirected  edges.  Divergent  edges  Vl  4- 
Uk  — >  Vj  will  be  represented  by  a  bidirected  edge  between  V%  and  Vj .  The  presence 
of  such  bidirected  edge  in  G  represents  unmeasured  factors  (or  confounders)  that 
may  influence  two  variables  in  V;  we  assume  that  substantive  knowledge  permits 
us  to  decide  if  such  confounders  can  be  ruled  out  from  the  model.  For  example, 
Figure  1.1  will  be  represented  by  Figure  1.2,  assuming  that  the  variable  U  is  a 
hidden  variable. 

Causal  Bayesian  networks  provide  a  strict  mathematical  language  for  reason¬ 
ing  with  causes  and  effects.  This  dissertation  addresses  various  issues  in  causal 
reasoning,  including  learning  causal  structures  from  data,  testing  causal  models, 
assessing  the  effects  of  actions,  and  determining  the  causes  of  effects. 


1.4  Contributions 

The  principal  contributions  of  this  dissertation  are 

•  The  establishment  of  a  necessary  and  sufficient  set  of  properties  for  inter¬ 
ventional  distributions  induced  by  causal  Bayesian  networks. 

•  A  new  method  of  discovering  causal  structures,  based  on  the  detection  of 
local,  spontaneous  changes  in  the  underlying  data-generating  model. 

•  A  procedure  for  systematically  identifying  functional  constraints  induced 
by  causal  Bayesian  networks  with  hidden  variables. 

•  A  procedure  for  systematically  identifying  causal  effects,  in  the  presence  of 
unmeasured  confounders,  from  a  combination  of  nonexperimental  data  and 
substantive  assumptions  encoded  in  the  form  of  a  directed  acyclic  graph. 

•  The  derivation  of  tight  bounds  on  probabilities  of  causation,  from  data 
obtained  in  experimental  and  observational  studies,  under  general  assump¬ 
tions  concerning  the  data-generating  process. 


5 


1.5  Overview 


In  Chapter  2,  we  offer  a  complete  characterization  of  interventional  distribu¬ 
tions  that  could  be  induced  by  a  causal  Bayesian  network.  We  show  that  the 
set  of  interventional  distributions  must  adhere  to  three  norms  of  coherence,  and 
we  demonstrate  the  use  of  these  norms  as  inferential  tools  in  tasks  of  learning 
and  identification.  In  Chapter  3,  we  propose  a  new  method  of  discovering  causal 
structures,  based  on  the  detection  of  local,  spontaneous  changes  in  the  underlying 
data-generating  model.  We  analyze  the  classes  of  structures  that  are  equivalent 
relative  to  a  stream  of  distributions  produced  by  local  changes,  and  devised  al¬ 
gorithms  that  output  graphical  representations  of  these  equivalence  classes.  We 
investigate  both  the  Bayesian  approach  and  an  approach  that  infers  structures 
by  detecting  distributional  changes.  Chapter  4  develops  a  systematic  procedure 
of  identifying  functional  constraints  induced  by  causal  Bayesian  networks  with 
hidden  variables.  The  procedure  facilitates  the  task  of  testing  causal  models  as 
well  as  inferring  such  models  from  data.  Chapter  5  concerns  the  assessment  of  the 
causal  effects  in  nonparametric  models.  The  chapter  establishes  new  criteria  for 
deciding  whether  the  assumptions  encoded  in  a  causal  graph  are  sufficient  for  as¬ 
sessing  the  strength  of  causal  effects  and,  if  the  answer  is  positive,  computational 
procedures  are  provided  for  expressing  causal  effects  in  terms  of  the  underlying 
joint  distribution.  Chapter  6  shows  how  to  use  the  results  in  Chapter  5  to  identify 
causal  effects  in  linear  models.  Chapter  7  deals  with  the  problem  of  estimating 
the  probability  of  causation,  that  is,  the  probability  that  one  event  was  the  cause 
of  another  in  a  given  scenario,  for  example,  the  probability  that  event  E  would 
not  have  occurred  if  it  were  not  for  event  C,  given  that  C  and  E  did  in  fact  occur. 
Starting  from  structural-semantical  definitions  of  the  probabilities  of  necessary  or 
sufficient  causation  (or  both),  we  show  how  to  bound  these  quantities  from  data 
obtained  in  experimental  and  observational  studies,  under  general  assumptions 
concerning  the  data-generating  process.  The  results  delineate  more  precisely  the 
basic  assumptions  that  must  be  made  before  statistical  measures  such  as  the 
excess-risk-ratio  could  be  used  for  assessing  attributional  quantities  such  as  the 
probability  of  causation. 
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CHAPTER  2 


A  Characterization  of  Causal  Models 

2.1  Introduction 

In  this  chapter,  we  seek  a  characterization  for  the  set  of  interventional  distribu¬ 
tions,  Pt(v),  that  could  be  induced  by  some  causal  Bayesian  network.  Whereas 
[PeaOO,  pp.23-4]  has  given  such  characterization  relative  to  a  given  network,  we 
assume  that  the  underlying  network,  if  such  exists,  is  unknown.  Given  a  col¬ 
lection  of  arbitrary  interventional  distributions,  we  ask  whether  the  collection 
is  compatible  with  the  predictions  of  some  underlying  causal  Bayesian  network. 
Section  2.2  identifies  three  properties  (of  the  collection)  that  are  both  necessary 
and  sufficient  for  the  existence  of  such  an  underlying  network.  Section  2.3  iden¬ 
tifies  necessary  properties  of  distributions  induced  by  semi-Markovian  models, 
causal  Bayesian  networks  in  which  some  of  the  variables  are  unmeasured.  Sec¬ 
tion  2.4  shows  how  the  properties  uncovered  in  Sections  2.2  and  2.3  can  be  used 
as  symbolic  inferential  tools  for  predicting  the  effects  of  actions  from  nonexper- 
imental  data  in  the  presence  of  unmeasured  variables.  The  Conclusion  section 
outlines  the  use  of  these  properties  in  learning  tasks  which  aim  at  uncovering  the 
structure  of  the  network. 

2.2  Interventional  Distributions  in  Markovian  Models 

Let  P*  be  a  set  of  arbitrary  interventional  distributions 

P*  =  {Pt(v)\T  C  V,t  €  Dm(T )}  (2.1) 

where  Dm(T)  represents  the  domain  of  T.  For  example,  assume  that  V  consists 
of  two  binary  variables  X  and  Y  with  the  domain  of  X  being  {zo^i}  and  the 
domain  of  Y  being  {yo5yi},  then  P*  contains  distributions  P(x,y ),  PX0(x,y), 
PXl(x,y),  Pyo(x,y),  Pyi(x,y ),  Pxo,yo(x,  y), . . .,  where  each  Pt(x,y )  is  an  arbitrary 
probability  distribution  over  X,  Y.  For  this  set  of  distributions  to  be  induced  by 
some  underlying  causal  Bayesian  network  such  that  each  Pt(x,y)  corresponds  to 
the  distribution  of  X,  Y  under  the  intervention  do(T  —  t )  to  the  causal  Bayesian 
network,  they  have  to  satisfy  some  norms  of  coherence.  For  example,  it  must 
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be  true  that  PXo(x 0)  1.  For  another  example,  if  the  causal  graph  is  X  — >  Y 
then  Pyo(x0 )  =  P(pc 0),  and  if  the  causal  graph  is  X  i —  Y  then  Pxo(yo)  =  P{y o), 
therefore,  it  must  be  true  that  either  Pyo(x 0)  =  P(x0)  or  PXo(yo)  =  P{y o)-  We 
would  like  to  know  what  properties  a  P*  set  must  satisfy  such  that  it  is  compatible 
with  some  underlying  causal  Bayesian  network.  In  this  section,  we  show  that  a 
P*  set  induced  from  a  Markovian  causal  model  is  fully  characterized  by  three 
properties:  effectiveness,  Markov,  and  recursiveness. 

Property  1  (Effectiveness)  For  any  set  of  variables  T, 

Pt(t)  =  1.  (2.2) 


Effectiveness  states  that,  if  we  force  a  set  of  variables  T  to  have  the  value  t,  then 
the  probability  of  T  taking  that  value  t  is  one. 

For  any  set  of  variables  S  disjoint  with  T,  an  immediate  corollary  of  effective¬ 
ness  reads: 

PtA*)  =  1,  (2.3) 

which  follows  from 


Pt,,(t )  ^  Pt,s{t,S )  =  1. 


Equivalently,  if  T)  C  T,  then 


P>M  =  { J 

We  further  have  that,  for  T\  C  T 


if  t\  is  consistent  with  t. 
if  t\  is  inconsistent  with  t. 

and  S  disjoint  of  T, 


j  Pt(s)  if  ti  is  consistent  with  t. 

(  0  if  t\  is  inconsistent  with  t. 


(2.4) 


(2.5) 


(2.6) 


Property  2  (Markov)  For  any  two  disjoint  sets  of  variables  Si  and  S2, 


Pv\(siUsi)  (®1  j  ^2)  Pv\si  (si')Pv\S2  (^2)  .  (2.7) 

An  equivalent  form  of  the  Markov  property  is:  For  any  set  of  variables  T  CP, 

Pt(v\t)  =  ]^[  Pv\M(vi).  (2.8) 

{i\v,ev\T} 
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Eq.  (2.8)  can  be  obtained  by  repeatedly  applying  Eq.  (2.7),  and  Eq.  (2.7)  follows 
from  Eq.  (2.8)  as  follows: 


Pv\(siUS2){sli  s2)  ~  1 1  Pv\{vi}(vi) 

ViGSiUSz 

=  Pv\{vi}{vi)  PJ  Pv\{ Vi}(vi) 

VitSi  VieSi 

—  Pv\,MPv\M.  (2.9) 

Definition  2  For  two  single  variables  X  andY,  define  “X  affects  Y  ”,  denoted 
by  X  Y,  as  3 W  C  V,w,x,  y,  such  that  Px%w(y )  /  Pw(y)-  That  is,  X  affects  Y 
if,  under  some  setting  w,  intervening  on  X  changes  the  distribution  ofY. 

Property  3  (Recursiveness)  For  any  set  of  variables  {X0, . . . ,  X k}  C  V , 

(Xq  X\)  A  ...  A  (Xfc_!  Xk)  =»  -.(X*  X0).  (2.10) 

Property  3  is  a  stochastic  version  of  the  (deterministic)  recursiveness  axiom  given 
in  [Hal98].  It  comes  from  restricting  the  causal  models  under  study  to  those 
having  acyclic  causal  graphs.  For  k  =  1,  for  example,  we  have  X  V  =>  ->(Y  ^ 
X),  saying  that  for  any  two  variables  X  and  Y ,  either  X  does  not  affect  Y  or 
Y  does  not  affect  X.  [Hal98]  pointed  out  that,  recursiveness  can  be  viewed  as 
a  collection  of  axioms,  one  for  each  k,  and  that  the  case  of  k  —  1  alone  is  not 
enough  to  characterize  a  recursive  model. 

Theorem  1  (Soundness)  Effectiveness,  Markov,  and  recursiveness  hold  in  all 
Markovian  models. 

Proof:  All  three  properties  follow  from  the  factorization  of  Eq.  (1.2). 
Effectiveness  From  Eq.  (1.2),  we  have 

Pt(T  =  t')  =  0  for  t'fft,  (2.11) 

and  since 

E  p^')  =  (2-12) 

t'eOm(T) 

we  obtain  the  effectiveness  property  of  Eq.  (2.2). 
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(2.13) 


Markov  From  Eq.  (1.2),  we  have 

Pt(v\t)  =Pt(t,v\t)  =  Y[  P(vi\pai). 

Vi£V\T 

Letting  T  =  V\  {V:}  in  Eq.  (2.13)  yields 

Pv\{vi}(vi)  =  P{vi\pai).  (2.14) 

Substituting  Eq.  (2.14)  back  into  Eq.  (2.13),  we  get  the  Markov  property 
(2.8),  which  is  equivalent  to  (2.7). 

Recursiveness  Assume  that  a  total  order  over  V  that  is  consistent  with  the 
causal  graph  is  Vi  <  •  •  •  <  Vn,  such  that  Vi  is  a  nondescendant  of  Vj  if 
Vi  <  Vj.  Consider  a  variable  Vj  and  a  set  of  variables  S  C  V  which  does 
not  contain  Vy  Let  Bj  —  {Vj  V  <  Vj,  Vj  €  V  \  S}  be  the  set  of  variables 
not  in  S  and  ordered  before  Vj,  and  let  Aj  =  (V|V)  <  Vj,  V)  €  V  \  S}  be 
the  set  of  variables  not  in  S  and  ordered  after  Vj.  First  we  show  that 

P„„,(h,)  =  P,(bJ).  (2.15) 

We  have 

Pvj ,s  [bj)  —  'y  ]  Pvj ,s ( aj )  bj ) 

aj 

=  E^WWi,Wai)>  (byEq.  (2.7))  (2.16) 

CLj 

where  PVj,s,aj(bj)  =  H  .a;.-/:  P{vi\Pa%)  IS  a  function  of  bj  and  its  parents. 
Since  all  variables  in  Aj  are  ordered  after  the  variables  in  Bj,  PV],s,a3  {bj)  is 
not  a  function  of  a,j.  Hence  Eq.  (2.16)  becomes 

Pvj,s{bj)  =  PVj,s,aj(bj )  'Pyj,s,bj{aj) 

aj 

=  iWM  (2.17) 

Similarly, 

ps(bj)  =  ps(vj,aj,bj) 

Vj 

=  Pvjts,aj  ( Pj)Ps,bj  (,Vj,  aj) 

Vj  ,Clj 

—  Pyj  ,s,aj  ( bj  )  y  ]  Ps,bj  {vj  j  aj) 

Vj  ,CLj 

=  p,j.  ,«(*>■)  (2.i8) 
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Eq.  (2.15)  follows  from  (2.17)  and  (2.18). 

From  Eq.  (2.15),  we  have  that,  for  any  two  variables  If  <  V)  and  any  set  of 
variables  S, 

PVjll(vi)  =  Ps(vi),  (2.19) 

which  states  that  if  X  is  ordered  before  Y  then  Y  does  not  affect  X,  based 
on  our  definition  of  “X  affects  Y” .  Therefore,  we  have  that  if  X  affects  Y 
then  X  is  ordered  before  Y,  or 

X  Y  =>  X  <  Y.  (2.20) 

Recursive  property  (2.10)  then  follows  from  (2.20)  because  the  relation  “<” 
is  a  total  order. 


□ 


To  facilitate  the  proof  of  the  completeness  theorem,  we  give  the  following 
lemma. 


Lemma  1  [Pea88,  p.124]  Given  a  DAG  over  V,  if  a  set  of  functions  fi(vi,pa,i) 
satisfy 


^  fi(vi,pai)  =  1,  and  0  <  fi(vi,pa{)  <  1,  (2.21) 

Vi£Dm(Vi) 

and  P(v)  can  be  decomposed  as 

PM-UfiM  (2.22) 

i 

then  we  have 

fi(vi,pai)  =  P(vi\pai),  i  —  1, . . . ,  n.  (2.23) 

Theorem  2  (Completeness)  If  a  P *  set  satisfies  effectiveness,  Markov,  and 
recursiveness,  then  there  exists  a  Markovian  model  with  a  unique  causal  graph 
that  can  generate  this  P *  set. 

Proof:  Define  a  relation  as:  X  -<  Y  if  X  Y.  Then  the  transitive  closure 
of  is  a  partial  order  over  the  set  of  variables  V  from  the  recursiveness 

property  as  shown  in  [Hal98].  Let  “<”  be  a  total  order  on  V  consistent  with  -<*. 
We  have  that 

if  X  <  Y  then  PVtS{x)  =  Ps{x)  (2.24) 
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for  any  set  of  variables  S.  This  is  because  if  PytS(x)  V  Ps(x),  then  Y  X,  and 
therefore  Y  -<  X,  which  contradicts  the  fact  that  X  <  Y  is  consistent  with  -<*. 

Define  a  set  PAi  as  a  minimal  set  of  variables  that  satisfies 

Ppai{Vi )  =  Pv\{vi}(Vi).  (2.25) 

We  have  that 

if  Vi  <  Vj,  then  Vj  #  PA,.  (2.26) 

Otherwise,  assuming  Vj  E  PA,  and  letting  PA[  —  PA,  \  {Vj},  from  Eqs.  (2.24) 
and  (2.25)  we  have 


Ppa'SVi)  =  Ppa>j(Vi)  =  Pv\{vi}(Vi),  (2.27) 

which  contradicts  the  fact  that  PAi  is  minimal.  From  Eq.  (2.26),  drawing  an 
arrow  from  each  member  of  PAi  toward  Vi,  the  resulting  graph  G  is  a  DAG. 

Substituting  Eq.  (2.25)  into  the  Markov  property  (2.8),  we  obtain,  for  any  set 


of  variables  T, 

pt(v\t)=  Ppai(Vi). 

(2.28) 

{■wm 

By  Lemma  1,  we  get 

PpaiVi)  =  P(Vi\pai). 

(2.29) 

From  Eqs.  (2.28),  (2.29),  and  the  effectiveness  property  (2.6),  Eq.  (1.2)  follows. 
Therefore,  a  Markovian  model  with  a  causal  graph  G  can  generate  this  P»  set. 

Next,  we  show  that  the  set  PA,  is  unique.  Assuming  that  there  are  two 
minimal  sets  PA,  and  PA\  both  satisfying  Eq.  (2.25),  we  will  show  that  their 
intersection  also  satisfies  Eq.  (2.25).  Let  A  =  PA,  n  PA\ ,  B  =  PA,  \  A,  B'  = 
PAi  \  A,  and  S  —  V  \  (PA,  U  PA(  U  {Vi}).  From  the  Markov  property  Eq.  (2.7), 
we  have 

Paipib  ,  S,Vi)  PajVi(b,  b  ,  s)Pv\{Vi}(Vi) 

=  Pa,vi(b,b',s)Pa>b(vi)  (2.30) 

Summing  both  sides  of  (2.30)  over  B'  and  S,  we  get 

Pa(b,Vi )  =  Pa,Vi(b)Pa,b(Vi).  (2.31) 

Substituting  Ppai(vi)  with  Ppa^(vi)  in  (2.31),  we  get 

Pa  (b,  Vi)  =  Pa,Vi(b)Pa,bf(Vi).  (2.32) 
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Summing  both  sides  of  (2.32)  over  B,  we  obtain 

Pa(Vi )  =  Pa,v(yi )  =  Ppafai),  (2.33) 

which  says  that  the  set  A  =  PA,  n  PA]  also  satisfies  Eq.  (2.25).  This  contra¬ 
dicts  the  assumption  that  both  PA,  and  PA]  are  minimal.  Thus  PA,  is  unique.  □ 


A  Markovian  model  also  satisfies  the  following  properties. 

Property  4  If  a  set  B  is  composed  of  nondescendants  of  a  variable  Vj,  then  for 
any  set  of  variables  S, 

PVj,s(b)  =  Ps{b).  (2.34) 

Proof:  If  B  is  disjoint  of  S,  Eq.  (2.34)  follows  from  Eq.  (2.15)  since  B  C  Bj. 
If  B  is  not  disjoint  of  S,  Eq.  (2.34)  follows  from  the  Effectiveness  property  and 
Eq.  (2.15).  □ 


Property  5  For  any  set  of  variables  S  C  V  \  (PA,  U  {V)}), 

PpaM^PpaM)-  (2-35) 

Proof:  Let  S'  =  V  \  {PA,  U  {V,}  U  S). 

Ppa{,s{vi)  —  ^  ]  Ppaj,s{s  ivi) 
s' 

=  J2Pv\{vi}^)PP^^^{s')  (by  Eq.  (2.7)) 

s' 

=  Ppa,,  (vi )  E  Ppat,s,Vi  {s')  (by  Eq.  (2.25)) 

s' 

=  Ppa,l{vl)  (2.36) 

□ 


Property  6 

PpaM)  =  P{vi\pai).  (2.37) 

Property  6  has  been  given  in  Eq.  (2.29). 

Property  7  For  any  set  of  variables  S  C  V ,  and  V,  0  S, 

Ps{vi\pai)  —  P{vi\pai),  for  pa,  consistent  with  s.  (2.38) 
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Proof:  Let  S'  —  V  \  ( PAi  U  {Vi}  U  S).  Assuming  that  paj  is  consistent  with  s,  we 
have 

Ps{vi,pai )  =  Ps{vi,pai,  s') 

Sl 

^YlPv\Vi}(Vi)Ps*i(Pai'S')  (by  E(h  (2-7)) 

s' 

=  P{vi\pa,i )  Y2  ps,vi  (Pai i  s')  (by  Eq.  (2.14)) 

s' 

=  PKipa^p^^i) 

=  P(wibai)-'”  (PGi)  (by  Property  4)  (2.39) 

which  leads  to  Eq.  (2.38).  □ 


2.3  Interventional  Distributions  in  Semi-Markovian  Mod¬ 
els 

When  some  variables  in  a  Markovian  model  are  unobserved,  the  probability  distri¬ 
bution  over  the  observed  variables  may  no  longer  be  decomposed  as  in  Eq.  (1.1). 
Let  V  =  {Vi,...,  14}  and  U  —  {U\, . . . ,  [/„'}  stand  for  the  sets  of  observed 
and  unobserved  variables  respectively.  In  a  semi-Markovian  model,  as  defined 
in  Chapter  1.3,  the  observed  probability  distribution  and  the  post-intervention 
distribution  are  given  by  Eqs.  (1.5)  and  (1.6)  respectively. 

If,  in  a  semi-Markovian  model,  no  U  variable  is  an  ancestor  of  more  than 
one  V  variable,  then  Ptiy)  in  Eq.  (1.6)  factorizes  into  a  product  as  in  Eq.  (1.2), 
regardless  of  the  parameters  {P(vl\pal,u1)}  and  {P(u)}.  Therefore,  for  such  a 
model,  the  causal  Markov  condition  holds  relative  to  Gv  (the  subgraph  of  G 
composed  only  of  V  variables),  that  is,  each  variable  V)  is  independent  on  all  its 
non-descendants  given  its  parents  P.4,  in  Gy-  And  by  convention,  the  U  variables 
are  usually  not  shown  explicitly,  and  Gv  is  called  the  causal  graph  of  the  model. 

The  causal  Markov  condition  is  often  assumed  as  an  inherent  feature  of  causal 
models  (see  e.g.  [KSC84,  SGS93]).  It  reflects  our  two  basic  causal  assumptions: 
(i)  include  in  the  model  every  variable  that  is  a  cause  of  two  or  more  other  vari¬ 
ables  in  the  model;  and  (ii)  Reichenbach’s  (1956)  common-cause  assumption,  also 
known  as  “no  correlation  without  causation,”  stating  that,  if  any  two  variables 
are  dependent,  then  one  is  a  cause  of  the  other  or  there  is  a  third  variable  causing 
both. 

If  two  or  more  variables  in  V  are  affected  by  unobserved  confounders,  the 
presence  of  such  confounders  would  not  permit  the  decomposition  in  Eq.  (1.1), 
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and,  in  general,  P(v)  generated  by  a  semi-Markovian  model  is  a  mixture  of  prod¬ 
ucts  given  in  (1.5).  However,  the  conditional  distribution  P{v\u)  factorizes  into 
a  product 

P(v\u)  =  Y[P(v,\pauul),  (2.40) 

i 

and  we  also  have 

p  (  ri{j[Vi^r}  P{vi\Paii u*)  for  all  v  consistent  with  T  =  t.  (2  41) 

t{v\u)  ^  q  for  ap  w  inconsistent  with  T  =  t. 

Therefore  all  Properties  1-7  hold  when  we  condition  on  u.  For  example,  the 
Markov  property  can  be  written  as 

P?;\(siUS2)  (®1 )  ^2 1^)  Pv\Mu)Pv\Mu).  (2.42) 

Let  P*(u)  denote  the  set  of  all  conditional  interventional  distributions 

P*(u)  =  {Pt(v\u)\T  C  V,te  Dm(T)}  (2.43) 

Then  P*(u)  is  fully  characterized  by  the  three  properties  effectiveness,  Markov, 
and  recursiveness,  conditioning  on  u. 

Let  P*  denote  the  set  of  all  interventional  distributions  over  observed  variables 
V  as  in  (2.1).  From  the  properties  of  the  P*(it)  set,  we  can  immediately  conclude 
that  the  P*  set  satisfies  the  following  properties:  effectiveness  (Property  1),  re¬ 
cursiveness  (Property  3),  Property  4,  and  Property  5,  while  Markov  (Property  2), 
Property  6,  and  Property  7  do  not  hold.  For  example,  Property  5  can  be  proved 
from  its  conditional  version, 

PPo.iys  ( Vi\u )  =  Ppai  (vi\u) ,  (2.44) 

as  follows 

Ppai, sip i)  -  J2PP^AVi\U)P(u)  =  ^^Ppai(Vi\u)P(u)  =  Ppai{Vi).  (2.45) 

u  u 

Significantly,  the  P*  set  must  satisfy  inequalities  that  are  unique  to  semi- 
Markovian  models,  as  opposed,  for  example,  to  models  containing  feedback  loops. 
For  example,  from  Eq.  (1.6),  and  using 

P(vi\pa,i,ul)  <  1,  (2.46) 

we  obtain  the  following  property. 

Property  8  For  any  three  sets  of  variables,  T ,  S ,  and  R,  we  have 

Ptr(s)  >  Pt(r,s )  +  Pr(t,s)  -  P(t,r,s)  (2.47) 

Additional  inequalities,  involving  four  or  more  subsets,  can  likewise  be  derived  by 
this  method.  However,  finding  a  set  of  properties  that  can  completely  characterize 
the  P*  set  of  a  semi-Markovian  causal  model  remains  an  open  challenge. 


15 


2.4  Applications  in  the  Identification  of  Causal  Effects 

Given  two  disjoint  sets  T  and  S ,  the  causal  effect  Pt(s)  is  said  to  be  identifiable  if, 
given  a  causal  graph,  it  can  be  determined  uniquely  from  the  distribution  P(v)  of 
the  observed  variables,  and  is  thus  independent  of  the  unknown  quantities,  P[u) 
and  P(vi\pa,i,  ul),  that  involve  elements  of  U.  Identification  means  that  we  can 
learn  the  effect  of  the  action  T  =  t  (on  the  variables  in  S)  from  sampled  data 
taken  prior  to  actually  performing  that  action.  In  Markovian  models,  all  causal 
effects  are  identifiable  and  are  given  in  Eq.  (1.2).  When  some  confounders  are  un¬ 
observed,  the  question  of  identifiability  arises.  Sufficient  graphical  conditions  for 
ensuring  the  identification  of  Pt(s)  in  semi- Markovian  models  were  established  by 
several  authors  [SGS93,  Pea93,  Pea95a]  and  are  summarized  in  [PeaOO,  Chapters 
3  and  4].  Since 

Pt(s)  =  ^Pt(s \u)P(u),  (2.48) 

U 

and  since  we  have  a  complete  characterization  over  the  set  of  conditional  inter¬ 
ventional  distributions  (P*(u)),  we  can  use  Properties  1-3  (conditioning  on  u)  for 
identifying  causal  effects  in  semi-Markovian  models. 

The  assumptions  embodied  in  the  causal  graph  can  be  translated  into  the 
language  of  conditional  interventional  distributions  as  follows: 

For  each  variable  Vj, 

Pv\{vi}(Vi\u)  =  Ppai(Vi  K).  (2.49) 

The  Markov  property  (2.8)  conditioning  on  u  then  becomes 

Pt(v\t\u)=  J]  Ppai(yiW).  (2.50) 

{i|V5GV\T} 

The  significance  of  Eq.  (2.50)  rests  in  simplifying  the  derivation  of  elaborate 
causal  effects  in  semi-Markov  models.  To  illustrate  this  derivation,  consider  the 
model  in  Figure  1.2,  and  assume  we  need  to  derive  the  causal  effect  of  X  on 
{Z,  Y},  a  task  analyzed  in  [PeaOO,  pp.86-8]  using  do-calculus.  Applying  (2.50)  to 
Px(y,z\u),  (with  x  replacing  t),  we  obtain: 

Px(y,  Z)  =  Y1  Px(y’  z\u)p(u ) 

u 

=  22  Pz(y\u)Px(z)P(u) 

u 

=  Px(z)Pz(y )  (2.51) 
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Each  of  these  two  factors  can  be  derived  by  simple  means;  Px(z)  =  P(z\x)  because 
Z  has  no  unobserved  parent,  and  Pz(y )  =  ]T)X/ P(y \x',z)P(x')  because  X  blocks 
all  back-door  paths  from  Z  to  Y  (they  can  also  be  derived  by  applying  (2.50)  to 
P(x,  y ,  z\u)).  As  a  result,  we  immediately  obtain  the  desired  quantity: 

Px(y ,  z)  =  P(z\x)  P(y\x’,  z)P(x'),  (2.52) 

X 1 

a  result  that  required  many  steps  in  do-calculus. 

In  general,  from  (2.50),  we  have 

Pt(v  \  t)  =  J2  IT  PvaMW)P(u).  (2.53) 

«  b|Viev\T} 

Depending  on  the  causal  graph,  the  right  hand  side  of  (2.53)  may  sometimes  be 
decomposed  into  a  product  of  summations  as 

fl(.\*)  =  n£  n  r„o.K)r(nj) 

j  rij  ViZSj 

=np'\'k'i)'  <2-54) 

i 

where  Nj’s  form  a  partition  of  U  and  5/ s  form  a  partition  of  V  \  T.  Eq.  (2.51) 
is  an  example  of  such  a  decomposition.  Therefore  the  problem  of  identifying 
Pt(v  \  t)  is  reduced  to  identifying  some  Pv\Sj(sj)’s.  Based  on  this  decomposition, 
a  method  for  systematically  identifying  causal  effects  is  developed  in  Chapter  5. 

2.5  Conclusion 

We  have  shown  that  all  experimental  results  obtained  from  an  underlying  Marko¬ 
vian  causal  model  are  fully  characterized  by  three  norms  of  coherence:  Effective¬ 
ness,  Markov,  and  Recursiveness.  We  have  further  demonstrated  the  use  of  these 
norms  as  inferential  tools  for  identifying  causal  effects  in  semi-Markovian  models. 
This  permits  one  to  predict  the  effects  of  actions  and  policies,  in  the  presence  of 
unmeasured  variables,  from  data  obtained  prior  to  performing  those  actions  and 
policies. 

The  key  element  in  our  characterization  of  experimental  distributions  is  the 
generic  formulation  of  the  Markov  property  (2.7)  as  a  relationship  among  three 
experimental  distributions,  instead  of  the  usual  formulation  as  a  relationship 
between  a  distribution  and  a  graph  (as  in  (1.1)).  The  practical  implication  of 
this  formulation  is  that  violations  of  the  Markov  property  can  be  detected  with¬ 
out  knowledge  of  the  underlying  causal  graph;  comparing  distributions  from  just 
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three  experiments,  -Pw\(Slus2)(si,  s2),  Pv\Sl(si)>  and  Pv\s2{s2),  may  reveal  such  vio¬ 
lations,  and  should  allow  us  to  conclude,  prior  to  knowing  the  structure  of  G ,  that 
the  underlying  data-generation  process  is  non-Markovian.  Alternatively,  if  our 
confidence  in  the  Markovian  nature  of  the  data-generation  process  is  unassailable, 
such  a  violation  would  imply  that  the  three  experiments  were  not  conducted  on 
the  same  population,  under  the  same  conditions,  or  that  the  experimental  inter¬ 
ventions  involved  had  side  effects  and  were  not  properly  confined  to  the  specified 
sets  Si,  S2,  and  Si  U  S2. 

This  feature  is  useful  in  efforts  designed  to  infer  the  structure  of  G  from  a 
combination  of  observational  and  experimental  data;  a  single  violation  of  (2.7) 
suffices  to  reveal  that  unmeasured  confounders  exist  between  variables  in  Si  and 
those  in  Likewise,  a  violation  of  any  inequality  in  (2.47)  would  imply  that 
the  underlying  model  is  not  semi-Markovian;  this  means  that  feedback  loops  may 
operate  in  data  generating  process,  or  that  the  interventions  in  the  experiments 
are  not  “atomic” . 
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CHAPTER  3 


Causal  Discovery  from  Changes 

3.1  Introduction 

Inferring  causal  structures  from  empirical  data  has  become  an  active  research 
area  in  recent  years.  Several  graph-based  algorithms  have  been  developed  for 
this  purpose.  Some  are  based  on  detecting  patterns  of  conditional  independence 
relationships  [PV91,  SGS93],  and  some  are  based  on  Bayesian  approaches  [CH92, 
Gei95,  Coo99].  These  discovery  methods  assume  static  environment,  that  is, 
a  time-invariant  distribution  and  a  time-invariant  data-generating  model,  and 
attempt  to  infer  structures  that  encode  dynamic  aspects  of  the  environment, 
for  example,  how  probabilities  would  change  as  a  result  of  interventions.  This 
transition,  from  static  to  dynamic  information,  constitutes  a  major  inferential 
leap,  and  is  severely  limited  by  the  inherent  indistinguishability  (or  equivalence) 
relation  that  governs  Bayesian  networks  [VP90]. 

One  way  of  overcoming  this  basic  limitation  is  to  augment  the  data  with 
partial  causal  knowledge,  if  such  is  available.  [SGS93],  for  example,  discussed 
the  use  of  experimental  data  to  identify  causal  relationships.  [CY99]  discussed  a 
Bayesian  method  of  causal  discovery  from  a  mixture  of  observational  and  exper¬ 
imental  data. 

We  propose  a  new  method  of  discovering  causal  relations  in  data,  based  on 
the  detection  and  interpretation  of  local  spontaneous  changes  in  the  environment. 
While  previous  methods  assume  that  data  are  generated  by  a  static  statistical 
distribution,  our  proposal  aims  at  exploiting  dynamic  changes  in  that  distribu¬ 
tion.  Such  changes  are  always  present  in  any  realistic  domain  that  is  embedded 
in  a  larger  background  of  dynamically  changing  conditions.  For  example,  natural 
disasters,  armed  conflicts,  epidemics,  labor  disputes,  and  even  mundane  decisions 
by  other  agents,  are  unexpected  eventualities  that  are  not  naturally  captured  in 
distribution  functions.  The  occurrence  of  such  eventualities  tend  to  alter  the 
distribution  under  study  and  yield  changes  that  are  markedly  different  from  or¬ 
dinary  statistical  fluctuations.  Whereas  static  analysis  views  these  changes  as 
nuisance,  and  attempts  to  adjust  and  compensate  for  them,  we  will  view  them  as 
a  valuable  source  of  information  about  the  data-generating  process.  A  controlled 
experimental  study  may  be  thought  of  as  a  special  case  of  these  environmental 
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changes,  where  the  external  influence  involves  fixing  a  designated  variable  to  some 
predetermined  value.  In  general,  however,  the  external  influence  may  be  milder, 
merely  changing  the  conditional  probability  of  a  variable,  given  its  causes.  More¬ 
over,  in  marked  contrast  to  controlled  experiments,  we  may  not  know  in  advance 
the  nature  of  the  change,  its  location,  or  even  whether  it  took  place;  these  may 
need  to  be  inferred  from  the  data  itself. 

The  basic  idea  has  its  roots  in  the  economic  literature.  The  economist  Kevin 
Hoover  (1990)  attempted  to  infer  the  direction  of  causal  influences  among  eco¬ 
nomic  variables  (e.g.,  employment  and  money  supply)  by  observing  the  changes 
that  sudden  modifications  in  the  economy  (e.g.,  tax  reform,  labor  dispute)  in¬ 
duced  in  the  statistics  of  these  variables.  Hoover  assumed  that  the  conditional 
probability  of  an  effect  given  its  causes  remains  invariant  to  changes  in  the  mech¬ 
anism  that  generates  the  cause,  while  the  conditional  probability  of  a  cause  given 
the  effect  would  not  remain  invariant  under  such  changes.  This  asymmetry  may 
be  useful  in  distinguishing  cause  and  effect. 

Today  we  understand  more  precisely  the  conditions  under  which  such  asym¬ 
metries  would  prevail  and  how  to  interpret  such  asymmetries  in  the  context  of 
large,  multi-variate  systems.  Whenever  we  obtain  reliable  information  (e.g.,  from 
historical  or  institutional  knowledge)  that  an  abrupt  local  change  has  taken  place 
in  a  specific  mechanism  that  constrains  a  given  family  of  variables,  we  can  use 
the  observed  changes  in  the  marginal  and  conditional  probabilities  surrounding 
those  variables  to  determine  the  direction  of  causal  influences  in  the  domain. 
The  statistical  features  that  remain  invariant  under  such  changes,  as  well  as  the 
causal  assumptions  underlying  this  invariance,  are  encoded  in  the  causal  graph  at 
hand,  and  can  be  used  therefore  for  testing  the  validity  of  a  given  structure.  Like¬ 
wise,  conflicts  between  observed  and  predicted  changes  can  be  used  for  automatic 
restructuring  of  the  topology  of  the  structure  at  hand. 

In  this  chapter,  we  will  assume  that  we  have  data  generated  from  a  dynami¬ 
cally  changing  environment  and  our  task  is  to  recover  the  actual  causal  structures. 
In  Section  3.2,  we  formally  present  this  learning  problem.  In  Section  3.3,  we  an¬ 
alyze  the  equivalence  classes  of  causal  structures  relative  to  the  given  data.  In 
Section  3.4,  we  analyze  the  patterns  of  distributional  changes  induced  by  data 
and  present  recovery  methods  that  infer  causal  directionality  information  from 
those  changes.  In  Section  3.5,  we  investigate  the  Bayesian  approach  for  causal 
discovery.  The  Bayesian  approach  [HMC97]  gives  us  a  consistent  way  of  combin¬ 
ing  dynamic  datasets  to  get  an  overall  estimation  of  causal  structures.  We  show 
how  to  derive  a  Bayesian  scoring  metric  from  various  types  of  dynamic  data  by 
assigning  appropriate  priors  over  probability  parameters.  The  Bayesian  scores 
obtained  are  extensions  of  previously  derived  Bayesian  scores  [CH92,  Gei95].  For 
mixed  observational  and  experimental  data  we  obtained  the  same  score  as  given 
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in  [CY99].  We  show  that  dynamic  data  increase  our  power  of  causal  discovery 
beyond  the  limits  set  by  independence  equivalence. 

3.2  Mechanism  Changes 

Let  our  problem  domain  be  a  set  of  discrete  random  variables  V  =  {V\, . . . ,  Vn}. 
In  this  chapter,  we  denote  a  causal  model  over  V  by  a  pair  M  =  <G,  ©g>,  where 
G  is  the  causal  graph  and  ©g  is  a  set  of  probability  parameters.  We  assume  that 
each  variable  V  can  take  values  from  a  finite  domain,  Dm(Vi)  =  {vn, . . . ,  vlTi}, 
where  r,-  is  the  number  of  states  of  V),  Let  0v>;pa,,  V{  G  Dmiyf),  pa,i  G  Dm^Paf) 
denote  the  multinomial  parameter  corresponding  to  the  conditional  probability 
P(vi\pa,i).  We  will  use  the  following  notations:  9pa%  =  {9Vi-pai\vi  G  Dm{Vi)},  T,  = 
VPa,eDm(Pai)Opai,  ©g  =  U”=1\l/;.  A  causal  model  M  —  <G,  ©g>  generates  a 
probability  distribution  given  in  Eq.  (1.1),  rewritten  as 

j,(«)  =  n<w  i3-1) 

i 

A  probability  distribution  P(V)  is  said  to  be  compatible  with  a  causal  graph  G 
if  P(V)  can  be  generated  by  some  causal  model  M  —  <G,  @g>- 

Based  on  the  modularity  assumption  that  each  family  in  the  causal  graph 
represents  an  autonomous  physical  mechanism  and  is  subjected  to  change  without 
influencing  other  mechanisms,  we  formally  define  mechanism  change  as  follows. 

Definition  3  (Mechanism  Change)  A  mechanism  change  to  a  causal  model 
M  =  <G,  ©g>  at  a  variable  Vt  is  a  transformation  of  M  that  produces  a  new 
model,  MVi  —  <G,  Q'G>,  where  Q'G  =  1L'U(0G\^i)  and  is  a  set  of  parameters 
having  values  that  differ  from  those  in  \Eq. 

We  will  assume  that  the  parent  set  Pa*  does  not  change  in  a  mechanism  change. 
We  see  that  an  intervention  that  fixes  Vt  to  a  particular  value  is  a  special  case 
of  a  mechanism  change.  Let  P(V)  be  the  distribution  generated  by  M,  as  in 
Eq.  (3.1).  Then  the  distribution  generated  by  My{  is  given  by 

Pv.M  =  »'nm,  IIW  (3-2) 

We  will  call  (P,  Pvt)  a  transition  pair  (TP)  and  Vt  the  focal  variable  of  the  tran¬ 
sition.  Assume  that  a  series  of  mechanism  changes  occurred  successively  to  a 
causal  model  M  =  <G,  ©g>,  and  let  F  =  (14 , . . . ,  Vik)  denote  the  correspond¬ 
ing  sequence  of  focal  variables.  We  use  Pts  —  (P°,  P1, . . . ,  Pk)  to  denote  the 
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sequence  of  distributions  generated  by  such  a  series,  and  call  the  pair  (Pts,  F)  a 
transition  sequence  (TS). 

As  oracles  for  cause-and-effect,  relations,  causal  models  can  predict  the  ef¬ 
fects  that  any  external  or  spontaneous  changes  have  on  the  distributions.  Con¬ 
versely,  by  detecting  how  probability  distributions  change  under  various  mech¬ 
anism  changes,  we  obtain  information  on  the  structure  of  the  model  generat¬ 
ing  those  distributions.  We  propose  to  exploit  the  stream  of  distributions  from 
mechanism  changes  to  recover  underlying  causal  structures.  In  this  chapter, 
we  make  the  following  assumptions:  each  mechanism  change  occurs  at  one  sin¬ 
gle  variable  at  a  time,  and  we  have  the  distribution  (or  samples  thereof)  after 
each  single  mechanism  change,  that  is,  we  know  when  each  mechanism  change 
happens  and  at  which  variable.  We  will  then  assume  that  we  are  given  a  TS 
(Pts,  F)  corresponding  to  some  causal  graph  G,  or,  we  have  a  sequence  of  datasets 
Brs  =  {D°, . .  . ,  Dk },  where  each  Dl  is  a  set  of  random  samples  from  a  distribu¬ 
tion  P\  such  that  each  pair  (P]~~1,  PJ)  is  a  TP  with  focal  variable  V. ,  and  our 
task  will  be  to  recover  a  causal  graph  (or  a  set  of  graphs)  that  can  generate  I$ts  ■ 
First,  we  study  what  can  be  learned  from  a  TS. 


3.3  Indistinguishability  of  Causal  Graphs 

Our  ability  to  recover  causal  graphs  is  limited  by  the  statistical  indistinguisha- 
bility  of  causal  models  with  given  data.  In  this  section,  we  study  the  classes  of 
causal  structures  that  are  indistinguishable  (or  “equivalent”)  relative  to  a  TS. 

The  statistical  information  provided  by  any  causal  graph  is  completely  en¬ 
coded  in  the  independence  relationships  among  the  variables.  Therefore,  two 
causal  graphs  are  statistically  indistinguishable  given  one  static  distribution  if 
and  only  if  they  are  independence  equivalent.  The  graphical  conditions  for  inde¬ 
pendence  equivalence  are  given  by  the  following  theorem. 

Theorem  3  (Independence  Equivalence)  Two  causal  graphs  are  independence 
equivalent  if  and  only  if  they  have  the  same  skeletons  and  the  same  sets  of  v- 
structures,  that  is,  two  converging  arrows  whose  tails  are  not  connected  by  an 
arrow  [VP90], 

Now  assume  that  we  have  a  TP  with  focal  variable  Vt.  A  causal  graph  G 
is  said  to  be  compatible  with  a  transition  pair  (P,  Pyf)  if  P  can  be  generated 
by  a  causal  model  M  —  <G ,  Qg>  and  Pyt  can  be  generated  by  a  causal  model 
=  <G,Q'g>  resulted  from  a  mechanism  change  to  M  at  Vi.  Note  that  a 
causal  graph  could  be  compatible  with  both  P  and  Pyt  but  not  compatible  with 
the  TP  (P,Pvf)-  Among  those  independence-equivalent  graphs  compatible  with 
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both  P  and  Pv:  ,  a  TP  (P,  Pyf)  can  distinguish  those  that  can  generate  PVi  from 
P  with  a  single  mechanism  change  from  those  that  can  not.  Two  causal  graphs 
Gi  and  G2  are  called  transition  pair  equivalent  with  respect  to  a  TP  with  focal 
variable  14,  or  1 ^-transition  equivalent ,  if  every  TP  (P,  P\/,)  compatible  with  G \ 
is  also  compatible  with  G4.  Two  causal  graphs  are  statistically  indistinguishable 
given  a  TP  (P,  PVi)  if  and  only  if  they  are  ^-transition  equivalent. 


Theorem  4  (Transition  Pair  Equivalence)  Two  causal  graphs  Gx  and  G2 
are  Vi-transition  equivalent  if  and  only  if  they  have  the  same  skeletons,  the  same 
sets  of  v- structures,  and  the  same  sets  of  parents  for  V). 


Proof:  Let  G\  be  compatible  with  a  TP  (P,  PVi).  G2  must  have  the  same  skeletons 
and  the  same  sets  of  v-structures  as  G i  to  be  compatible  with  P  (and  PVi)  by 
Theorem  3.  We  have  the  following  decomposition: 

P(v)  =  P{vi\pa})  JJP(u,-|pa})  =  P{vi\pa%)\[P{vj\pa)),  (3.3) 


where  Pa]  and  Pa]  are  parents  of  lq  in  Gi  and  G2  respectively.  Gx  is  compatible 
with  the  TP  (P,  Pvf),  hence  can  generate  PVi  from  P  by  a  mechanism  change  at 
Vn 

Pvt(v )  =  PVt(vi\pa])  P{vj\pa]).  (3.4) 

Plugging  the  expression  for  Ylj^iP(vj\Paj)  from  Eq.  (3.3)  into  Eq.  (3.4),  we  have 


Pv,(v) 


Pviivilpa]) 


P(vi\pai) 

P{vi\pa] ) 


npfoW)- 


(3.5) 


G2  is  also  compatible  with  the  transition  pair  (P,  P\/J  if  and  only  if 


Pvx  (v)  =  PVt  (Vi  I  pa])  P{vj  I  pa)).  (3.6) 

i#* 


Eqs.  (3.5)  and  (3.6)  lead  to 


pv,(vi\pa]) 


P{vi\yal) 

P{yi\pa\) 


=  PvMlpa]), 


(3.7) 


which  holds  for  any  distribution  P  and  Pv%  if  and  only  if  Gi  has  the  same  parent 
set  for  Vi  as  G2  (Pa]  —  Pa])]  if  Gx  has  a  different  parent  set  for  Vi  with  G 2, 
Eq.  (3.7)  will  impose  some  constraints  between  P  and  Pyi5  and  will  not  hold  for 
arbitrary  possible  transition  pair  (P,  Py).  □ 
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Figure  3.1:  (a)The  Cancer  network,  (a)-(d)  are  independence  equivalent,  (e)-(g) 
are  B-transition  equivalent.  A  mechanism  change  on  A  determines  a  unique 
causal  graph  (h). 
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A  TS  is  simply  a  series  of  TP’s.  Accordingly,  we  say  that  a  causal  graph  is 
compatible  with  a  transition  sequence  PTS  =  (P°,P\...,Pk),  F=(Vh,...,Vlk) 
if  it  is  compatible  with  each  TP  (P-7"1,  Pj)  in  the  sequence.  Likewise,  two  causal 
graphs  G'i  and  G 2  are  called  transition  sequence  equivalent  with  respect  to  a  TS 
(. Pts,F ),  or  F -transition  equivalent,  if  every  TS  (Pts,F)  compatible  with  G\  is 
also  compatible  with  G2.  Two  causal  graphs  are  statistically  indistinguishable 
given  a  TS  (PTS,  F)  if  and  only  if  they  are  P-transition  equivalent. 

Theorem  5  (Transition  Sequence  Equivalence)  Two  causal  graphs  are  F- 
transition  equivalent  if  and  only  if  they  have  the  same  skeletons,  the  same  sets  of 
v- structures,  and  the  same  sets  of  parents  for  variables  in  F. 

Theorem  5  says  that  a  TS  determines  the  directions  of  the  edges  between  the 
focal  variables  and  their  neighbors  (among  the  set  of  independence-equivalent 
graphs).  See  Figure  3.1  for  an  example  of  TS  equivalence. 

Given  a  TS,  the  most  we  can  expect  to  recover  is  a  set  of  causal  graphs  that 
are  TS-equivalent,  as  defined  by  Theorem  5.  We  may  find  this  equivalence  class 
by  detecting  independence  relations  and  distribution  changes. 

3.4  Learning  Causation  by  Detecting  Changes 

In  this  section,  we  identify  the  causal  information  that  can  be  learned  by  detecting 
various  changes  in  the  probability  distributions,  in  particular,  changes  in  the 
marginal  probability  of  each  variable.  The  following  theorem  is  obvious. 

Theorem  6  A  mechanism  change  at  a  variable  X  to  a  causal  model  M  = 
<G,  Qg>  may  alter  the  marginal  probabilities  of  the  descendants  of  X  in  G  and 
can  not  alter  the  marginals  of  nondescendants  of  X. 

It  is  possible  of  course  that,  for  some  peculiar  parameter  changes,  the  marginal 
probabilities  of  some  descendants  of  X  would  not  change.  When  recovering  causal 
information  from  distributional  changes,  we  assume  a  restriction  on  a  TS  called 
influentiality. 

Definition  4  (influentiality)  A  TP  (P,  Px)  generated  by  a  causal  model 
<G ,  0 g>  is  said  to  be  influential  if  for  every  descendant  Y  of  X  in  G,  the 
marginal  distribution  Px(Y)  is  different  from  P(Y).  A  TS  is  influential  if  every 
TP  in  the  sequence  is  influential. 

Assuming  influentiality,  we  can  obtain  causal  information  by  detecting  changes 
of  marginal  probabilities. 
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Given  a  TP  (P,  Px),  and  assuming  that  we  can  test  each  variable  for  marginal 
distribution  change,  we  can  draw  the  following  inferences.  If  the  marginal  of  a 
variable  Y  has  changed,  we  conclude  that  Y  is  a  descendant  of  X.  If  the  marginal 
of  a  variable  Z  has  not  changed,  we  conclude  that  Z  is  a  nondescendant  of  X. 
We  thus  conclude  that  Z  <  X  <  Y  should  be  a  causal  order  consistent  with  the 
causal  graph.  Next  we  discuss  how  to  piece  together  ordering  information  of  this 
kind,  as  obtained  from  a  TS. 

3.4.1  Partitioning  the  variables 

Given  a  TS  Pts,  F  =  (Vq, . . . ,  V4),  each  variable  can  be  characterized  by  a 
sequence  of  l’s  and  0’s,  a  tag  01, . . . ,  a*,,  where  a*  reflects  whether  the  marginal  of 
that  variable  changed  (oj  =  1)  or  not  (a*  =  0)  in  the  ith  transition  of  the  sequence. 
Non-focal  variables  that  are  given  the  same  tags  cannot  be  distinguished  by  the 
TS  (through  detecting  marginal  changes),  and  no  information  can  therefore  be 
extracted  about  their  relative  causal  order  in  the  causal  graph.  We  may  put 
all  such  variables  into  a  bucket  labeled  with  the  same  tag,  denoted  by  Bai...ak. 
Clearly,  since  we  have  no  information  on  causal  relations  among  variables  within 
the  same  bucket,  all  variables  in  a  bucket  stand  in  the  same  ordering  relation  to 
all  variables  in  another  bucket.  Focal  variables  need  special  treatment  since  they 
carry  more  information,  and  we  will  put  each  focal  variable  into  an  individual 
bucket  called  a  focal  bucket ,  denoted  by  £?£  . 

We  classify  variables  into  buckets  with  the  following  algorithm. 

Algorithm  1  (Partitioning  Variable) 

Input:  a  TS  PTSl  F  =  (Vil, . . . ,  V,lk). 

Output:  A  set  of  buckets,  each  associated  with  a  tag  cq  .  . .  a*,  and  each  containing 
a  set  of  variables. 

Put  all  variables  in  a  bucket  D. 

For  the  ith  mechanism  change,  i  =  1, . . . ,  k, 

For  each  bucket  Bai...ai_x  including  focal  buckets 

if  it  contains  the  ith  focal  variable,  put  it  in  a  focal  bucket  B(1..,a._ll. 
put  other  changing  variables  in  Bai...a._1i. 
put  non-changing  variables  in  Bai...ai_1 0. 

We  show  the  partitioning  process  by  an  example.  Assume  that  the  actual 
causal  graph  is  the  DAG  shown  in  Figure  3.2(a)  and  that  we  are  given  a  TS 
(P,  Px,  Py)-  In  the  first  transition,  with  X  as  the  focal  variable,  P(Y)  does 
not  change,  hence  B0  =  {V};  P(X),  P(Z),  P(W),  P(Q)  do  change,  hence  we 
form  Bi  =  {Z,W,Q},  B{  =  {A}.  Note  that  a  focal  variable  is  put  into  an 
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individual  bucket.  In  the  second  transition,  with  F  as  the  focal  variable,  P(Y) 
changes,  giving  B({,  —  {F};  P(Z)  and  P(W)  change,  giving  Bn  =  {Z,  IF}; 
P(Q)  and  P( X)  do  not  change,  giving  Bw  =  { Q }  and  B{0  =  {A"}.  As  a  result, 
the  variables  are  partitioned  into  four  buckets:  B{0  =  { A } ,  B(yi  —  {F},  Bl0  = 
{Q},BL1  =  {Z,W}. 

3.4.2  Extracting  causal  information 

We  shall  now  discuss  what  causal  information  we  can  extract  from  the  tags  at¬ 
tached  to  buckets.  Consider  any  two  buckets  Bai...ak  and  Bbl...bk.  If  there  exists 
a  bit  such  as  at  <  bz  (i.e. ,  a,;  =  0  and  bt  =  1),  it  must  be  that,  in  the  zth  tran¬ 
sition,  the  marginals  of  variables  in  Bai...ak  did  not  change  and  the  marginals 
of  variables  in  Bbl...bk  did.  Therefore,  no  variable  in  Bai...ak  is  a  descendant  of 
any  variable  in  Bbl...bk.  On  the  other  hand,  if  there  exists  another  bit  such  that 
a,-  >  bj  ( Gj  =  1  ,bj  =  0),  then  no  variable  in  Bb l...bk  is  a  descendant  of  any  vari¬ 
able  in  Bai...ak,  which  means  that  there  exists  no  directed  path,  in  particular  no 
edge,  between  any  variable  in  Bai...ak  and  any  variable  in  Bbi...bk.  The  equality 
a*  =  bi,i  —  1, ...  ,k  can  only  happen  if  one  of  the  buckets  is  a  focal  bucket,  in 
which  case  the  focal  variable  is  an  ancestor  of  all  the  variables  in  the  other  bucket. 
In  summary,  the  relation  between  two  buckets  Bai...ak  and  Bbl...bk  is  determined 
as  follows: 

R1  a*  <  bi,  i  =  1, . . . ,  k  and  3 j,  a,j  <  by.  variables  in  Ba  1...ak  are  nondescendants 
of  variables  in  Bbl...bk,  denoted  by  Bai...ak  <  Bbl...bh. 

R2  Oj  >  bi}i  =  1, . . . ,  k  and  3 j,  aj  >  by.  Bbl...bk  <  Bav..ak. 

R3  There  exist  two  bits  i  /  j  such  that  a,  <  bi  and  a}  >  bj :  there  can  be  no 
directed  path  between  any  variable  in  Bai...ak  and  any  variable  in  Bbl...bk. 

R4  a*  =  bi,i  —  1  ,...,k,  one  of  the  buckets,  say  is  a  focal  bucket: 

all  variables  in  Bbl...bk  must  be  descendants  of  the  focal  variable  in  Bfi...afc, 
which  is  a  stronger  relation  than  that  in  R1  and  R2  but  will  still  be  denoted 

by  BL-ak  <  Bbl...bk. 

The  focal  buckets  convey  more  information.  Let  Bai...ak  be  a  focal  bucket  contain¬ 
ing  the  focal  variable  Vtj  for  the  j th  transition.  Then  if  bj  =  1,  we  have  that  all 
variables  in  Bbl...bk  are  descendants  of  V%j  since  their  marginals  changed  in  the  j  th 
transition.  This  rule  is  consistent  with  the  above  rules  R1-R3,  hence  it  is  applied 
only  in  R4  when  R1-R3  cannot  determine  a  relation.  However,  in  practice,  due 
to  imperfect  statistical  tests,  there  may  be  conflicts  between  them.  For  example, 
we  may  determine  that  there  is  no  edge  between  Bai...ak  and  Bbl...bk  by  R3  and  in 
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the  same  time  Ba  1...ak  is  a  focal  bucket  for  the  j th  transition  and  bj  —  1.  These 
conflicts  signal  mistakes  in  the  statistical  tests,  and  whenever  there  are  conflicts, 
we  will  declare  the  relation  as  “unknown” .  We  summarize  the  above  discussions 
with  the  following  algorithm. 

Algorithm  2  (Extracting  Relation) 

Input:  two  buckets  Bai...ak  and  Bbl...bk. 

Output:  the  relation  between  the  two  buckets,  could  be  “<  ”,  “no-directed-path 
(NDP)”,  or  “unknown”. 

1.  ai  <  bi,  i  =  1, . . . ,  k  and  3j,  a,j  <  bj :  if  Bbl...bk  is  a  focal  bucket  for  the  Ith 
transition  and  ai  =  1  then  “unknown”,  else  Bai...ak  <  Bbl...bk. 

2.  ai  >  bi,  i  =  1, . . . ,  k  and  3j,  aj  >  bj:  if  Bai...ak  is  a  focal  bucket  for  the  Ith 
transition  and  bi  —  1  then  “unknown” ,  else  Bbl...bh  <  Bai...ak. 

3.  There  exist  two  bits  i  /  j  such  that  a *  <  bi  and  aj  >  bj:  if  Bbl...bk  is  a  focal 
bucket  for  the  Ith  transition  and  oj  =  1  or  Bai...a k  is  a  focal  bucket  for  the 
Ith  transition  and  b\  =  1  then  “unknown” ,  else  “NDP”. 

4-  ai  =  bi,  i  —  1, . . . ,  k:  if  both  buckets  are  focal  buckets  then  “unknown” ,  else 
let  the  focal  bucket  be  B{v..a  ,  then  <  Bbl...bh. 

Consider  the  binary  relation  “<”  on  the  set  of  buckets  as  defined  in  the 
Algorithm  2.  We  have  the  following  theorem. 

Theorem  7  The  binary  relation  “<  ”  on  the  set  of  buckets  is  a  partial  order. 

Proof:  The  relation  is  transitive.  If  Bai...ak  <  Bhl...bk  and  Bbl...bk  <  BCl...Ck,  we 
have  ai  <  bi  <  Ci,i  =  1, . . .  ,k. 

1.  3j,  aj  <  Cj.  If  BCl...Ck  is  not  a  focal  bucket,  then  we  have  Bai...ah  <  BCl...Ch. 
If  BCl...Ck  is  a  focal  bucket  for  the  /th  transition  and  a;  —  1,  then  bi  =  1 
since  a;  <  bi  <  cj,  which  contradicts  Bbl...bk  <  BCl...Ck. 

2.  Oj  =  Cj,  i  =  1, . . . ,  k.  Then  a*  =  £>j  =  Cj,  i  =  1, . . . ,  k,  and  then  Bai...ak  has 
to  be  a  focal  bucket  and  Bbl...bh  is  not  one  in  order  to  have  the  relation 
Bai-ak  <  Bbl...bk,  which  then  contradicts  Bbl...bk  <  BCl...Ch. 

The  relation  is  antisymmetric.  If  Bai...ak  <  Bbl...bk  and  Bbl...bk  <  Bai...a,k,  then 
^  =  bi,i  —  l, ...  ,k.  Since  they  cannot  both  be  focal  buckets,  they  must  be  the 
same  bucket.  □ 
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Figure  3.2:  (a)  A  causal  graph;  (b)  The  order  graph  for  the  TS  (P,  Px,  Py);  (c) 
The  marked  order  graph. 

A  partially  ordered  set  can  be  represented  by  a  DAG.  We  construct  a  graph 
with  both  directed  and  undirected  edges,  called  an  order  graph  (OG),  as  follows: 
a  node  represents  a  bucket;  for  each  pair  of  buckets  B  and  B' ,  there  is  a  directed 
edge  B  — *  B1  if  B  <  P';  there  is  an  undirected  edge  B — B'  if  the  relation 
between  them  is  “unknown” .  If  we  had  a  perfect  statistical  test  for  distributional 
changes,  an  OG  would  be  a  DAG.  For  the  causal  graph  shown  in  Figure  3.2(a) 
and  the  TS  (P,  Px,  Py),  the  ideal  OG  is  given  in  Figure  3.2(b). 

In  an  OG,  when  B  is  a  focal  bucket,  a  directed  edge  B  — >  B'  asserts  that 
there  exists  a  directed  path  from  the  focal  variable  contained  in  B  to  all  the 
variables  in  B' .  Hence,  if  there  is  no  other  mixed  directed  path,  a  path  that  could 
contain  undirected  edges  but  no  directed  edges  in  the  reverse  direction,  from  B 
to  B'  in  the  OG,  there  must  be  an  edge  from  B  to  at  least  one  variable  in  B' 
in  the  causal  graph.  We  mark  this  type  of  edges  as  B  — B' ,  to  distinguish 
them  from  those  that  only  represent  potential  edges  in  the  causal  graph.  This 
information  is  useful  when  the  child  bucket  B'  contains  only  one  variable;  we 
then  assert  that  the  edge  B  — >  B'  must  exist  in  the  causal  graph.  We  will  call 
an  OG  with  marked  edges  a  marked  order  graph  (MOG);  an  example  is  shown  in 
Figure  3.2(c). 

An  algorithm  for  constructing  a  MOG  is  given  in  the  following. 

Algorithm  3  (Constructing  MOG) 

Input:  an  influential  TS  with  known  focal  variables. 

Output:  a  marked  order  graph. 

1.  Put  variables  into  buckets  using  Algorithm  1. 

2.  Extracting  relations  among  buckets  using  Algorithm  2. 
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3.  Let  each  bucket  be  a  node. 

4-  For  each  pair  of  nodes  B  and  B' 

If  B  <  B' ,  add  an  edge  B  — >  B' . 

If  B'  <  B,  add  an  edge  B'  — >  B. 

If  the  relation  is  “unknown”,  add  an  edge  B — B' . 

5.  For  each  focal  bucket  B f  and  each  of  its  child  B 

If  there  is  no  other  mixed  directed  path  from  B?  to  B,  mark  the  edge  as 

Bf-^B. 

In  summary,  the  information  conveyed  by  a  IvIOG  is  as  follows: 

1.  An  unmarked  edge  B  — >  B'\  All  variables  in  B  can  be  ordered  before  all 
variables  in  B'  in  the  causal  graph,  in  other  words,  there  are  no  directed 
paths  from  variables  in  B'  to  variables  in  B.  When  B  is  a  focal  variable, 
there  exists  a  directed  path  from  B  to  each  variable  in  B'  in  the  causal 
graph. 

2.  A  marked  edge  B  B There  exists  a  directed  path  from  B  to  each 
variable  in  B' .  In  the  case  that  both  B  and  B'  contain  one  single  variable, 
the  edge  B  — >  B'  must  exist  in  the  causal  graph. 

3.  No  edge  between  B  and  B'\  there  is  no  directed  path,  in  particular  no  edge, 
between  any  variable  in  B  and  any  variable  in  B'  in  the  causal  graph. 

3.4.3  Limitation  of  detecting  marginal  changes 

Can  we  fully  recover  a  causal  graph  by  detecting  marginal  distribution  changes 
alone?  To  fully  recover  a  causal  graph,  we  must  construct  a  MOG  in  which 
each  bucket  contains  only  one  variable  and  every  edge  is  marked.  This  may 
not,  in  general,  be  achieved.  Considering  a  causal  graph  G  containing  a  path 
X  — y  Z  — y  Y,  it  is  clear  that  we  can  never  determine  if  there  is  an  edge 
X  — >  Y  in  G,  since  all  marginal  changes  produced  by  transitions  would  be  the 
same  after  adding  that  edge.  What  is  the  best  we  can  get  then  by  detecting 
marginal  changes? 

Given  a  DAG  G,  if  we  remove  an  edge  X  — >  Y  whenever  there  is  a  directed 
path  from  X  to  Y,  we  get  the  transitive  reduction  of  G.  The  transitive  reduction 
of  a  DAG  G  is  the  graph  G'  with  the  fewest  edges  such  that  the  transitive  closure 
of  G'  is  equal  to  the  transitive  closure  of  G.  The  transitive  closure  of  a  DAG  G  is 
the  graph  G"  such  that  an  edge  X  — »  Y  is  in  G"  iff  there  is  a  directed  path  from 
X  to  Y  in  G.  By  detecting  marginal  changes  in  TS’s,  the  best  we  can  hope  to 
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Figure  3.3:  (a)  A  causal  graph;  (b)  The  order  graph  for  the  TS  (P,Px,Py) 
without  knowing  the  focal  variables;  (c)  The  marked  order  graph. 

get  is  the  transitive  reduction  of  the  actual  causal  graph.  Since  to  mark  an  edge 
X  — »  Y,  X  must  be  a  focal  variable,  it  follows  that  every  node  except  leaf  nodes 
must  be  a  focal  variable  in  order  to  mark  every  edge  in  the  transitive  reduction 
graph.  To  further  make  each  bucket  contain  only  one  variable,  every  leaf  node 
having  the  same  set  of  parents  as  another  leaf  node  must  be  a  focal  variable. 

In  conclusion,  by  detecting  marginal  distribution  changes,  the  best  we  can 
learn  is  the  transitive  reduction  of  the  causal  graph,  and  we  can  achieve  it  by  a 
TS  in  which  every  variable  has  had  its  mechanism  changed. 

3.4.4  Unknown  focal  variables 

In  this  section  we  discuss  situations  where  we  know  that  a  mechanism  change  has 
occurred  at  a  single  variable  but  we  do  not  know  the  identity  of  that  variable. 

We  first  note  that,  without  knowing  the  focal  variables,  variables  can  still  be 
partitioned  into  buckets  using  Algorithm  1,  and  the  relations  between  pairs  of 
buckets  will  be  determined  by  rules  R1-R3  of  Section  3.4.2.  Second,  an  order 
graph  can  be  constructed  as  follows:  for  each  pair  of  buckets  B  and  B',  there 
is  a  directed  edge  B  — >  B'  if  B  <  B' .  For  the  causal  graph  of  Figure  3.3(a) 
and  the  TS  (P,  Px,Py),  the  variables  are  partitioned  into  three  buckets:  B io  = 
{X,  Q},B0i  =  {Y},Bn  —  {Z,W},  and  the  OG  is  shown  in  Figure  3.3(b). 

Finally,  we  may  be  able  to  find  to  which  bucket  a  focal  variable  belongs  using 
the  following  theorem,  assuming  influentiality  and  perfect  statistical  tests.  (We 
still  call  such  a  bucket  a  “focal  bucket”,  because  it  behaves  as  a  focal  variable 
with  the  information  at  hand.) 

Theorem  8  Let  Sj  be  the  set  of  buckets  for  which  aj  =  1  in  their  tags  Oi . . .  o*,, 
then  the  focal  bucket  F J  for  the  j  th  transition  is  in  Sj  and  for  any  other  bucket 
B  e  Sj,  Fi  <  B. 
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Proof:  Let  the  focal  variable  X  for  the  jth  transition  be  tagged  as  a\ . . .  a*,  then 
aj  —  1,  since  P{X)  must  change  in  this  transition.  All  other  variables  in  the  set 
of  buckets  Sj  must  be  descendants  of  X  since  all  their  marginals  changed  in  the 
j th  transition.  Therefore,  whenever  P{X)  changes,  their  marginals  must  change 
too,  that  is,  if  a*  =  1  then  b,t  =  1  for  any  variable  tagged  as  bL  . . .  bk  in  Sj,  which 
leads  to  a*  <  bt,  i  —  1, . . . ,  k.  Hence  for  any  bucket  Bbl_bk  G  Sj  not  containing 
X,  we  have  Bai,„ak  <  Bh_bk.  □ 

In  practice,  Theorem  8  may  fail  to  identify  a  focal  bucket  when  (due  to  im¬ 
perfect  statistical  tests)  there  exists  no  bucket  F i  in  Sj  satisfying  F <  B  for  any 
other  bucket  B  G  Sj.  In  the  case  that  an  identified  focal  bucket  contains  only  one 
variable,  we  actually  identify  a  focal  variable.  For  the  OG  in  Figure  3.3(b),  the 
focal  buckets  for  the  first  and  second  transitions  can  be  found  as  Bw  —  { X ,  Q} 
and  B0i  =  {Y}  respectively,  and  we  actually  identify  Y  as  the  focal  variable  of 
the  second  transition. 

Finally  we  can  get  a  MOG  by  marking  edges  as  in  Algorithm  3.  For  our 
working  example,  the  ideal  MOG  is  shown  in  Figure  3.3(c). 

3.4.5  TSs  absent  of  influentiality 

If  we  allow  for  the  possibility  that  a  mechanism  change  at  X  may  not  alter  the 
marginal  probabilities  of  some  of  X’s  descendants,  then  detecting  no  change  in 
P(Y)  provides  no  information  on  the  causal  relation  between  X  and  Y.  The 
information  we  may  obtain  is  that  detecting  a  change  in  P(Y)  means  that  Y  is 
a  descendant  of  the  focal  variable  X.  First  we  partition  variables  into  tagged 
buckets  using  Algorithm  1.  Then  the  relationship  among  buckets  is  determined 
as:  let  J3*  be  the  focal  bucket  for  the  zth  transition;  Bl  <  Bai,,,ak  if  a*  =  1,  where 
“<”  represents  that  all  variables  in  Bai,„ak  are  descendants  of  the  focal  variable 
B\  Finally  we  compute  the  transitive  closure  of  <  relation,  denoted  by  <*,  to 
get  more  information.  Simultaneous  B  <*  B'  and  B'  <*  B  would  mean  change 
detection  errors  and  the  relation  between  B  and  B'  will  be  declared  as  unknown. 
The  information  conveyed  by  B  <*  B'  is  that  all  variables  in  B'  are  descendants 
of  the  focal  variable  B  in  the  underlying  causal  graph. 

It  is  clear  that  if  the  identities  of  the  focal  variables  are  not  given,  we  can  not 
get  any  order  information  from  a  TS  by  detecting  marginal  changes. 

3.4.6  Combining  static  and  dynamic  information 

So  far,  we  discussed  how  to  extract  causal  information  given  a  TS  by  detecting 
distributional  changes.  In  this  section,  we  briefly  describe  how  to  combine  this 
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information  with  that  obtained  from  independence  tests. 

Given  data  from  a  static  stable  distribution,  we  can  recover  (partially  directed) 
causal  graphs  using  conditional  independence  tests.  Several  such  algorithms  have 
been  developed,  including  IC  algorithm  [PeaOO,  section  2.5]  (initially  introduced 
in  [PV91])  and  PC  algorithm  [SGS93].  The  output  of  these  algorithms  is  a  par¬ 
tially  oriented  graph  representing  an  independence-equivalence  class  as  defined 
by  Theorem  3. 

To  recover  a  causal  graph  from  a  TS,  we  first  extract  causal  information 
by  detecting  distribution  changes  as  described  in  Section  3.4,  then  run  the  IC 
algorithm  using  the  causal  information  as  prior  knowledge.  Note  that  since  a  TS 
is  composed  of  a  series  of  different  distributions,  we  need  to  test  independence 
relationships  across  all  distributions. 

We  may  obtain  three  types  of  causal  information  as  shown  in  Section  3.4: 
causal  order  among  certain  variables,  no  edges  between  certain  variables,  and 
certain  directed  edges.  The  last  two  types  (no-edge  and  determined-edge)  can  be 
incorporated  directly.  Causal  order  information  can  be  used  to  restrict  the  search 
of  candidate  conditional  sets  and  thus  reduce  the  complexity  of  the  IC  algorithm. 
Causal  order  information  can  also  be  used  to  orient  more  edges:  any  undirected 
edge  X — Y  can  be  oriented  as  X  — ¥  Y  if  X  is  ahead  of  Y  in  the  causal  order. 
These  methods  of  incorporating  background  knowledge  have  been  discussed  in 
[SGS93,  Section  5.4.5]. 

When  the  identities  of  all  focal  variables  are  known,  after  incorporating  these 
causal  information  as  background  knowledge,  the  output  of  the  IC  Algorithm 
would  be  a  partially  oriented  graph  representing  the  TS  equivalence  class  as  de¬ 
fined  by  Theorem  5.  This  is  due  to  a  theorem  in  [Mee95]  which  says  that  the 
orientation  rules  in  the  IC  algorithm  are  complete  with  respect  to  any  consistent 
background  knowledge.  If  the  identity  of  a  focal  variable  is  not  given  or  iden¬ 
tified  as  in  Section  3.4.4,  the  edge  directions  between  this  focal  variable  and  its 
neighbors  may  not  be  fixed,  hence  the  output  graph  is  not  maximally  oriented, 
and  we  have  not  obtained  all  the  information  implied  by  a  TS.  Algorithms  for 
identifying  focal  variables  are  currently  under  investigation. 

3.4.7  Experimental  results 

We  use  x2  test  to  detect  distribution  changes.  Let  D1  and  D 2  be  two  datasets, 
consisting  of  N\  and  N2  cases  respectively.  Let  Nix  and  N2x  be  the  number  of 
cases  in  D 1  and  D2  respectively  in  which  a  variable  X  takes  the  value  x.  To  test 
the  hypothesis  that  X  has  the  same  distribution  in  the  two  datasets,  we  compute 
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the  quantity 


1 


X 


2 


AWE 

X 


N\x  +  N2x 


Nix_ 

Ah 


^hxs2 
N2  j  ’ 


(3.8) 


which  is  asymptotically  a  x2  distribution  with  rx  —  1  degree  of  freedom,  where  rx 
is  the  number  of  states  of  X.  Let  the  significance  level  be  a.  If  X'2  >  x2  then  we 
decide  “change” ,  else  we  decide  “no-change” . 


A  mechanism  change  at  a  variable  Vi  is  simulated  as  follows.  Consider  pa¬ 
rameters  in  0pai.  If  9Vtl]pat  <  0.5  then  let  9'Vii.pa%  =  6ViUpai  +  <5,  else  let  0'Vil.pa.  = 
9Vii-Pa,  ~  <5,  where  8  is  a  parameter  for  adjusting  the  change  magnitude.  The  rest 
of  the  parameters  in  9pai  are  changed  in  proportional  to  their  original  values  as: 

1  -  dm-,pai)-  When  we 


Vij  \pai 


=  ot9Vij.pav  j  =  2, . . .  ,n,  where  a  =  (1 


simulate  a  mechanism  change  at  Vj,  we  change  parameters  in  9pai  as  above  for 
each  pa,i  €  Dm(Pai). 


In  our  experiments,  we  used  data  generated  from  a  known  network,  the  Alarm 
Bayesian  network1  [BSC89] .  Samples  used  in  the  experiment  were  generated  from 
the  network  using  a  demo  version  of  Netica  API  developed  by  Norsys  Software 
Corporation.  We  used  equal  sample  sizes  for  all  datasets  in  a  TS,  that  is,  a 
sample  size  N  represents  that  N  cases  were  generated  for  each  dataset  Dl  in 
I hs  =  {D°,...,Dk}. 


3.4.7. 1  Errors  in  detecting  changes 

There  are  two  types  of  errors  in  detecting  changes:  (i)  mistaking  “no-change” 
for  a  “change”,  known  as  type  I  error  and  denoted  NC2C,  and  (ii)  mistaking 
“change”  as  “no-change”,  known  as  type  II  error  and  denoted  C2NC.  Let  G 
be  the  causal  graph  used  for  generating  samples.  When  a  mechanism  change 
occurs  at  a  variable  V,  if  our  test  statistics  is  perfect,  all  V^s  descendants  in  G 
should  be  identified  as  “change”  and  Ij’s  nondescendants  as  “no-change”.  Let 
DeCi  be  the  number  of  descendants  of  V%  in  G  and  NDeCi  be  the  number  of 
nondescendants  of  V).  Let  c2nc,  be  the  number  of  descendants  of  Vt  identified 
as  “no-change”  by  the  x2  tesC  and  let  nc2cj  be  the  number  of  nondescendants 
of  Vi  identified  as  “change”.  nc2cj  and  c2nc,  represent  the  number  of  type  I 
and  type  II  mistakes  made  by  the  x2  statistics.  In  any  one  run,  we  simulate  a 
mechanism  change  at  each  node  Vj,  i  —  1, . . . ,  n,  relative  to  the  original  network, 
and  compute  the  C2NC  error  rate  as  JT  c2nCj/  DeCi  and  the  NC2C  error  rate 
as  nc2cj/  JT  NDeCi.  We  computed  an  average  error  rate  over  5  runs. 

1  We  used  the  version  downloaded  from  the  web  site  of  Norsys  Software  Corporation, 
http://www.norsys.com. 
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Figure  3.4:  Type  I  and  Type  II  errors  of  y2  statistics. 


We  varied  the  change  magnitude  <5,  the  sample  size,  and  the  significance  level 
a,  and  the  results  are  shown  in  Figure  3.4.  We  see  that  the  NC2C  (type  I)  error 
rate  is  nearly  the  same  as  the  a  value  for  different  change  magnitudes  and  sample 
sizes,  as  expected.  The  C2NC  (type  II)  error  could  be  large  when  the  a  value  is 
small  or  the  change  magnitude  is  small.  This  suggests  that  we  should  consider 
using  a  two-tailed  y2  test  [SBMOO]  to  control  the  C2NC  error,  especially  when 
the  sample  size  is  not  large.  In  a  two-tailed  y2  test,  we  use  another  threshold 
a!  >  a  such  that  we  decide  “no-change”  only  when  y2  <  y2, ,  but  we  have  to 
decide  “unknown”  when  y2,  <  y2  <  y2. 

3. 4. 7. 2  Errors  in  order  graphs 

In  an  OG,  an  edge  B  — >  B'  represents  that  all  variables  in  B  can  be  causally 
ordered  before  the  variables  in  B' .  We  call  this  type  of  information  “order  claims” . 
No  edge  between  B  and  B'  represents  the  absence  of  directed  paths,  in  particular 
edges,  between  variables  in  B  and  those  in  B']  this  information  will  be  called 
“no-directed-path  (NDP)  claims”  and  “no-edge  claims”  respectively.  An  edge 
B — B'  only  signals  mistakes  in  the  statistical  tests  and  will  be  called  “unknown 
claims” .  We  performed  the  following  experiments:  for  certain  5,  a,  sample  size, 
and  focal  variables,  we  generate  datasets,  construct  an  OG,  count  the  claims, 
and  check  against  the  true  network  to  compute  percentage  errors  for  each  type 
of  claims.2 

2Claims  are  counted  between  pairs  of  variables  not  between  pairs  of  buckets.  Numbers  vary 
with  the  focal  variables  picked,  hence  we  did  100  runs,  each  time  randomly  picking  a  sequence 
of  k  variables  as  focal  variables,  and  computed  average  numbers. 
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Table  3.1:  Errors  in  order  graphs.  k\  the  number  of  focal  variables,  m:  the 
number  of  buckets.  Ea:  percentage  error  of  order  claims.  Ep:  percentage  error 
of  NDP  claims.  Ee:  percentage  error  of  no-edge  claims,  u:  number  of  unknown 
claims. 


N  =  500 


order  claim 

NDP  claim 

k 

5 

a 

m 

# 

E0 

# 

Ep 

Ee 

u 

5 

0.1 

0.01 

8 

275 

0.13 

37 

0.3 

0.049 

0 

5 

0.1 

0.05 

11 

355 

0.12 

88 

0.32 

0.039 

3 

5 

0.5 

0.01 

10 

379 

0.03 

84 

0.31 

0.027 

1 

5 

0.5 

0.05 

12 

391 

0.036 

111 

0.3 

0.03 

5 

10 

0.1 

0.01 

15 

354 

0.13 

137 

0.3 

0.044 

1 

10 

0.1 

0.05 

21 

335 

0.11 

241 

0.3 

0.044 

11 

10 

0.5 

0.01 

18 

360 

0.02 

206 

0.3 

0.027 

5 

10 

0.5 

0.05 

23 

323 

0.026 

274 

0.29 

0.032 

19 

N  =  5000 


order  claim 

NDP  claim 

k 

5 

a 

m 

# 

E0 

# 

Ep 

Ee 

u 

5 

0.1 

0.01 

10 

369 

0.044 

80 

0.3 

0.025 

2 

5 

0.1 

0.05 

12 

393 

0.051 

109 

0.3 

0.031 

5 

5 

0.5 

0.01 

10 

400 

0.014 

78 

0.19 

0.015 

2 

5 

0.5 

0.05 

12 

406 

0.026 

104 

0.26 

0.027 

7 

10 

0.1 

0.01 

19 

364 

0.027 

207 

0.28 

0.02 

6 

10 

0.1 

0.05 

23 

334 

0.029 

260 

0.28 

0.033 

20 

10 

0.5 

0.01 

19 

377 

0.0081 

191 

0.25 

0.02 

9 

10 

0.5 

0.05 

23 

334 

0.018 

265 

0.26 

0.03 

22 

The  results  are  shown  in  Table  3.1  for  various  sample  size  N,  number  of 
focal  variables  k,  mechanism  change  magnitude  <5,  and  significance  level  a.  From 
Table  3.1,  we  see  that  the  NDP  claims  have  a  high  percentage  of  error;  however,  if 
those  claims  are  interpreted  as  representing  no-edge  only,  then  the  error  rates  are 
much  lower.  As  expected,  the  error  rates  are  lower  when  <5,  the  change  magnitude, 
is  larger,  and  a  TS  with  more  focal  variables  produces  more  no-edge  claims. 
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3.5  Causal  Discovery  by  the  Bayesian  Approach 

3.5.1  The  Bayesian  approach 


Assume  that  we  have  a  set  of  random  samples  D  generated  from  a  causal  model 
M  =  <G ,  0G>-  In  the  Bayesian  approach,  we  compute  the  posterior  probability 
of  a  causal  graph  G  given  the  dataset  D  as: 


P{G\D,0 


P{D\G,QP{G\Q 

P(D\0 


(3.9) 


where  £  represents  our  background  knowledge.  The  marginal  likelihood  of  the 
data  given  G  is  computed  as 


P(D\G,  0=1  P(D\Qg,  G,  £)P(0g|G,  OdQc-  (3.10) 


The  term  P(D\Qg,G,0  is  the  probability  of  the  data  given  a  Bayesian  network 
and  is  computable.  We  need  to  provide  prior  distributions  for  the  probability 
parameters,  P(Qg\G,0>  and  causal  graphs,  P(G\0 .  The  term  P{D\0  is  just  a 
proportional  constant. 

We  can  then  compute  the  posterior  probability  of  any  hypothesis  of  interest  by 
averaging  over  all  possible  causal  models.  For  example,  the  posterior  probability 
that  X  causes  Y  is  computed  as 

P(X^Y\D,0=  P(G  1^-0.  (3-H) 

A'-4VeG' 

where  the  summation  is  over  all  causal  graphs  which  contain  the  edge  X  — »  Y. 
Since  the  number  of  possible  graphs  is  exponential  in  the  number  of  variables 
n,  it  is  impractical  to  sum  over  all  graphs  unless  for  very  small  n.  One  way  to 
deal  with  this  problem  is  to  use  the  relative  posterior  probability  P(D,  G |£)  as  a 
scoring  metric  and  search  for  graphs  with  high  scores. 


3.5.2  Derivation  of  Bayesian  score 

For  the  case  that  the  dataset  D  is  from  a  static  distribution,  closed  form  expres¬ 
sions  for  P(D\G,0  have  been  derived  [CH92,  Gei95].  We  will  extend  previous 
derivations  to  incorporate  dynamic  data. 

Assume  that  we  have  two  data  sets,  D  and  D\  generated  from  a  causal 
graph  G  but  with  different  parameters,  0G  and  Q'G  respectively.  The  marginal 
likelihood  is  computed  as: 

p(d,d'\g,o  =  I  P(D,D'\eG,e'G,G,OP(eG,e'G\G,OdeGd&G.  (3.12) 


37 


Assuming  that  data  cases  are  random  samples,  and  that  the  data  are  complete , 
that  is,  every  variable  is  assigned  a  value  in  all  data  cases,  we  have 


P(D,  D' |0G,  0'G,  G,  0  -  P(D |0G,  G,  OP(D> |0'G,  G,  0 

N  N' 

=  n^ieo,c,f)n^;ie'o,c,0 


/=1 


1=1 


n 

1  I  I  I  1  I  n'Nvi,Pai 

1||]  |  uVi;pa,i  V  Vi\pai  , 


(3.13) 


i=l  Vi  pa,i 


where  N  is  the  number  of  cases  in  D,  C(  represents  a  specific  case  in  D,  and 
NVijpai  is  the  number  of  cases  in  D  for  which  Vi  takes  the  value  Vi  and  its  parents 
Pa j  takes  the  value  pa*.  We  use  as  a  shorthand  for  EUeDm(v-)  anc^  Y[pai  f°r 

n  pai(:Dm(Pai)' 

Consider  the  prior  distribution  P(0G,  0g|G,  £).  Assume  that,  as  a  back¬ 
ground  knowledge,  the  two  datasets  D  and  D'  are  from  a  TP  ( P ,  P1)  with  known 
focal  variable  V).  Therefore,  the  two  sets  of  parameters  0G  and  @G  differ  only 
by  those  parameters  in  T/.  With  this  knowledge,  we  assume  the  following  prior: 

P(Qg,  &g\G,  Vh  e)  -  P(Qg\G,  QPW \G,  0  n  -  *'),  (3.14) 


where  S(x)  is  the  Dirac  delta  function.  Eq.  (3.14)  says  that  for  i  ^  l,  =  \Eq, 
and  the  reader  can  verify  that  P(0G,  Q'g\G,  Vt,  f)  integrates  to  1  and  is  a  valid 
density  function.  We  have  put  V)  as  a  condition  to  reflect  the  fact  that  V)  is 
known  as  the  focal  variable  of  the  TP. 

For  the  parameter  priors  P(0g|G,£)  and  P(T(|G,^),  we  use  the  following 
assumptions  given  in  [Gei95j: 


•  Global  Parameter  Independence : 

n 

P(Qg\G,Z)  =  Y[p(%\G,S)  (3.15) 

2—1 


•  Local  Parameter  Independence: 

P(%\G,  o  =  n  p(iJG,  0,  *  =  1 ,  ■  ■  ■ ,  n.  (3.16) 

pai 

•  Parameter  Modularity :  if  V)  has  the  same  parents  in  two  causal  graphs  G i 
and  Cr2,  then 

PfipatlGuO  =  P(6pai\G2,£),  pa i  E  Dm(Pai).  (3.17) 


38 


While  these  assumptions  were  originally  made  for  learning  Bayesian  networks, 
[Hec95]  discussed  their  implications  for  causal  Bayesian  networks. 

Using  Eq.s  (3.13)— (3. 17),  and  integrating  out  0'G\^,  Eq.  (3.12)  is  transformed 
to 

P(®tp\G,  VU)  =  nn  / (Ue^a' :)p^\0d9pai 

ijtl  pa,  Vi 

X  n  / 

pai  J  vi 

x  n  /  <3-18) 

pat  vi 

where 

MVl  ,pai  =  N v.  .p,M  +  N'v .  pa . .  (3.19) 

We  use  the  notation  Dpp  =  {D,  D'}  and  put  V)  as  a  condition  to  emphasize  that 
Eq.  (3.18)  is  obtained  under  the  assumption  that  the  datasets  D  and  D'  are  from 
a  TP  with  known  focal  variable  V).  The  standard  assumption  for  P(9m  [£)  is  a 
Dirichlet  distribution: 

P{dPai\€>  =  Dir(6pai\apai),  (3.20) 

where  apai  =  {aVi-pai\ Vi  €  Dm(Vi)}  denotes  the  set  of  parameters  for  the  Dirichlet 
distribution.  Assuming  that  the  set  of  parameters  9'pai  have  the  same  prior 
distribution  as  6pai  given  by  Eq.  (3.20),  we  obtain 


p(»rp|G,D,e)=nn 


r(Qpa;)  TT  r  (oiVi-iPat  +  MVupaiJ 


i^l  pa, 


r(apai  +  Mpai) 
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h(o!pa;)  1  l  r( OiVi\pai  T  ^vi,pai) 
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nn 


pai  "  r"i  PaiJ  Vi 

where  P(-)  is  the  Gamma  function,  and 

pai  —  'y  (  aVi  ;pa,i  >  Npaj  —  ^  )  NVi  }Pai ,  Mpai  =  )  MV{  tPOii . 


r(crWi;pa() 


a 


(3.21) 
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3.5.3  Likelihood  equivalence 

For  two  independence-equivalent  causal  graphs  G\  and  G2,  any  distribution  com¬ 
patible  with  G\  is  also  compatible  with  G2.  Hence,  it  is  reasonable  to  assume  that 
a  dataset  D  from  a  static  distribution  cannot  distinguish  between  independence- 
equivalent  causal  graphs,  or,  P(D|Gi,£)  =  P(jD|G2,£)-  [Gei95]  call  this  as¬ 
sumption  likelihood  equivalence.  They  show  that  it  constrains  the  space  of  prior 
parameters  aVi-pai  and  call  the  resulting  likelihood-equivalent  Bayesian  scoring 
metric  the  BDe  metric.  We  will  use  prior  parameters  that  satisfy  the  like¬ 
lihood  equivalence  property,  and  call  the  associated  metric  P(Bj>p,  G\Vi,  £)  = 
P(Btp\G,  Vt,0P(G\0  the  BDe.TP  metric. 

The  BDe_TP  metric  is  not  likelihood  equivalent,  and  for  a  good  reason.  A  TP 
can  indeed  distinguish  independence-equivalent  graphs:  among  those  independence- 
equivalent  graphs  compatible  with  both  P  and  Pyt ,  a  TP  (P,  Py()  can  distinguish 
those  that  can  generate  Pvt  from  P  with  a  single  mechanism  change  from  those 
that  can  not. 

It  is  natural  to  extend  the  likelihood  equivalence  requirement  and  define  a  new 
property:  a  marginal  likelihood  P(B|G,£)  is  said  to  be  Vj -transition  likelihood 
equivalent  if  for  any  dataset  B  and  two  ^-transition  equivalent  causal  graphs  G\ 

and  G2lP(B|G1,0  =  P(D|G2,0- 

Theorem  9  The  marginal  likelihood  P(Bj\p|G,  V),  £)  given  by  Eq.  (3.21)  is  Vi- 
transition  likelihood  equivalent. 


Proof:  Eq.  (3.21)  can  be  rewritten  as 


p(nrP\G,vl,o  =  nn 


r(opai) 


n 


r(Oit)i;poi  +  MVi 


iPai  ) 
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n 


Vi 


i  P&i 
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T((Xpai  +  Mpai) 
T(apai)r(apai  +  My 


; pai ) 


Pat ) 


Pai 


r(«Pa,  +  Npai)T(apai  +  Npai) 


r  (otvnpai  +  EfV[tPai)T(aVl]pai  +  NVhpai) 


,paiJ 


(3.22) 


Let  G i  and  G2  be  two  Vj-transition-equivalent  causal  graphs.  Then  G\  and  G2  are 
independence  equivalent  and  have  the  same  parent  set  Pai  by  Theorem  4.  The 
first  term  in  Eq.  (3.22)  has  exactly  the  same  form  as  the  BDe  score  and  takes  the 
same  values  for  two  independence-equivalent  graphs  [Gei95].  The  second  term 
obtains  the  same  values  for  G\  and  G2  since  they  have  the  same  Pai  set.  □ 
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We  see  that  given  data  from  a  TP,  previously  indistinguishable  independence- 
equivalent  causal  graphs  may  now  be  distinguished,  and  in  this  sense,  two  datasets 
generated  from  a  same  causal  structure  but  with  different  parameters  give  us 
more  power  to  learn  the  structure.  This  power  comes  from  our  assumption  (or 
knowledge)  that  only  a  single  causal  mechanism  has  changed  in  generating  the 
two  datasets.  Indeed,  if  we  have  no  knowledge  on  how  the  two  sets  of  parameters 
0G  and  0'G  differ,  we  may  only  assume  that  they  are  independent  and  have  the 
same  distributions: 

P(QG,e'G\G,0  =  P(Qg\G,OP(Q'g\G,0,  (3-23) 

which  leads  to  a  marginal  likelihood  given  by 
P(D,  D'\G,  0  =  P(D\G,  OP(D'\G ,  0 


TTTT  F(ttpai) 

TT  F(a 

vnpai  +  NVupai) 

i  patr(apai+ Npat, 

l11 

Vi 

F  (avi\pai) 

TTTT  F(apai) 

TT  F(ai 

Oil Pai  3”  NVupa.) 

11 

Vi 

F  (0£Vi\pai) 

Eq.  (3.24)  is  a  product  of  two  BDe  likelihood  applied  on  datasets  D  and  D' 
respectively,  and  is  still  likelihood  equivalent.  Hence,  without  knowledge  on  how 
they  came  about,  two  datasets  do  not  increase  our  power  of  discrimination,  save 
for  providing  more  samples. 


3.5.4  Incorporating  experimental  data 


Now  assume  that  our  knowledge  is  that  the  cases  in  D'  are  from  an  experimental 
study  in  which  the  variable  \\  is  fixed  to  a  value  vij  E  Dm(Vi),  denoted  by 
do(Vi  =  vij)  or  do{vij).  Then  instead  of  the  Dirichlet  distribution,  we  assign  the 
following  prior  distribution  to  the  parameter  set  9'pai\ 

-  1)  II  SK;m)’  (3-25) 


which  asserts  that 


O' 

uvi  ;pai 


_  /  1  if  Vi  =  Vij 


0  otherwise 
Plugging  Eq.  (3.25)  into  Eq.  (3.18),  we  obtain 

F(<vJ  FI  r(®  Vi\pai  “1“  MVi^pa,i) 


P(DrP|G,do(t;,i),0  =  nn 


i^l  pa.t 
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pat  +  NPal)  ^ 


*  I  r(tV- 
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(3.26) 
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Eq.  (3.26)  has  been  given  in  [CY99].  Here  we  show  that  it  can  be  derived  by 
providing  an  informative  parameter  prior  as  given  by  Eqs.  (3.14)  and  (3.25).  In 
the  derivation  of  Eq.  (3.26),  we  have  used  the  following  equation 

I  (IT  -  1)  n  =  r  (3-27) 

Vi  Vl^Vlj 

which  follows  from  that  for  vi  /  vij,  N'ai  =  0. 

Theorem  10  The  likelihood  P(Btp\G,  clo(vij),t;)  given  by  Eq.  (3.26)  isVi-transition 
likelihood  equivalent. 

Proof:  The  same  proof  for  Theorem  9.  □ 


3.5.5  Combining  various  types  of  dynamic  data 

So  far  we  have  only  considered  the  situations  with  two  datasets.  The  discussions 
can  be  easily  extended  to  the  situations  with  a  sequence  of  datasets,  generated 
from  a  TS.  Let  D  =  {L>°,  D1, . . . ,  Dk }  be  a  sequence  of  datasets  generated  from 
some  causal  graph  G  with  parameters  0G,...,0G  respectively,  and  let  Eq  = 
uf=o0G.  The  marginal  likelihood  is  computed  as 

P(P|G,0  =  J  P(P\ZG,G,Z)P(EG\G,Z)dZG.  (3.28) 

The  term  P(B|HG,  G,  £)  can  be  computed  as  in  Eq.  (3.13).  To  give  an  appropriate 
parameter  prior  P(Eg\G,  £),  we  need  to  know  how  these  datasets  in  B  came 
about.  Assume  that  we  have  the  knowledge  that  the  sequence  of  datasets,  which 
will  now  be  denoted  by  BTs,  are  from  a  TS  with  a  sequence  of  focal  variables 
F  —  (Eq, . . . ,  Vik).  Then,  we  assume  the  following  prior: 

P(Eg\G,  F,  0  =  P(Q°g\G,  0  (P(tkq  |G,  f)  JJ  <5(^J  -  T°)) 

2 

•  ■  ■  (P(*1\G,Z)  J]  ^  (3-29) 

where  we  have  used  the  notation  0^,  =  U"=1\Ed,  j  =  0, . . . ,  k  as  before.  Eq.  (3.29) 
is  an  extension  of  Eq.  (3.14),  and  says  that  the  set  of  parameters  0G  differs  with 
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©■(L1  only  by  the  parameters  in  \Ed. .  Let  I  =  {4, . . . ,  4}  be  the  set  of  indexes  for 
focal  variables.  Using  the  Dirichlet  priors,  we  obtain  the  following  expression  for 
the  marginal  likelihood  (3.28): 
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and  i\P  is  the  number  of  cases  in  the  dataset  D:'  for  which  Vt  takes  the  value 
Vi  and  its  parents  Pai  takes  the  value  pa, .  Note  that  MVupai  —  L[upa.  +  Mvupa. 
is  the  number  of  cases  in  the  whole  dataset  B ts  for  which  Vi  takes  the  value  u, 
and  its  parents  Pa*  takes  the  value  pan.  We  will  call  the  Bayesian  scoring  metric 
P(Brs,G\F,£)  =  P(Bts\G,  F,f)P(G\f)  (with  parameters  aVi-pai  satisfying  the 
likelihood  equivalence  property)  the  BDe_TS  metric. 

A  marginal  likelihood  P(B>|G',  £)  is  said  to  satisfy  the  property  of  F -transition 
likelihood  equivalence  if  for  two  P-transition  equivalent  causal  graphs  G\  and  G 2, 

P(B!G1,e)  =  P(B|G2,0- 


Theorem  11  The  marginal  likelihood  P{H>rs  |G,  P,  £)  given  by  Eq.  (3.30)  is  F- 
transition  likelihood  equivalent. 


Proof:  Similar  to  the  proof  of  Theorem  9. 


□ 


Assume  that  a  series  of  mechanism  changes  occurred  to  a  same  causal  model 
M  =  <G,  Q%>,  and  let  F  =  (V^, . . . ,  Vik)  denote  the  sequence  of  focal  variables, 
and  Pes  =  ( P° ,  P1,  •  •  • ,  Pk )  the  corresponding  sequence  of  distributions,  where 
each  pair  (P°,PJ)  is  a  TP  with  Vtj  as  the  focal  variable.  We  will  call  the  pair 
{Pes,  F )  an  experimental  sequence  (ES).  An  example  of  an  ES  is  a  series  of  exper¬ 
imental  studies  performed  on  a  model.  Now  assume  that  we  have  the  knowledge 
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that  the  sequence  of  datasets,  which  will  now  be  denoted  by  BEs,  are  from  an  ES 
with  the  focal  variables  F  —  (Viti . . . ,  Vik).  We  then  assume  the  following  prior: 


P(Ea\G,  F,()  =  P(e°c\G.Q(p «|G,f)  JQiW  -  *?)) 

»#U 

(p(*iic,on  *(*!-«») 
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-*?))•  (3.31) 


Eq.  (3.31)  is  also  an  extension  of  Eq.  (3.14),  and  says  that  the  set  of  param¬ 
eters  Qjg  differs  with  only  by  the  parameters  in  \Ed. .  Using  the  Dirichlet 
distribution,  the  marginal  likelihood  is  given  by 
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A  special  case  of  ES  is  a  series  of  experimental  studies  in  which  each  variable 
in  F  is  fixed  to  some  value  respectively.  Then  we  use  the  prior  given  in  Eq.  (3.25) 
for  P(T(, \G,  £),  j  =  1, . . . ,  k,  and  we  obtain 
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Eq.  (3.34)  has  been  given  in  [CY99]. 


Theorem  12  The  marginal  likelihood  P (Be s\G,  P,  £)  m  (3.32)  and 
P(B>es  |G,  do(F),  ()  in  (3.34)  F -transition  likelihood  equivalent. 
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Table  3.2:  The  posteriors  of  edges  in  the  Cancer  network. 


8  =  0.1,  B  as  the  focal  variable. 


N 

P{A  -i  B|B) 
BDe-TP  BDe 

P(A  -1  C|D) 

B  DeSTP  BDe 

P(B  Dp) 
BDe-TP  BDe 

P(C  -1  D|D>) 
BDe-TP  BDe 

P(C  -s-  £|B) 
BDe-TP  BDe 

100 

0.138 

0.419 

0.103 

0.0394 

0.997 

0.87 

0.853 

0.86 

0.552 

0.441 

200 

0.335 

0.482 

0.354 

0.136 

1 

0.993 

0.983 

0.993 

0.607 

0.403 

500 

0.604 

0.686 

0.43 

0.457 

1 

0.999 

0.996 

1 

0.713 

0.728 

1000 

0.999 

0.733 

0.338 

0.49 

1 

1 

1 

1 

0.667 

0.74 

2000 

1 

0.75 

0.336 

0.5 

1 

1 

1 

1 

0.666 

0.75 

8  =  0.5,  B  as  the  focal  variable. 

100 

0.999 

0.238 

0.0325 

0.0141 

1 

0.484 

0.284 

0.293 

0.0733 

0.239 

200 

1 

0.289 

0.212 

0.0516 

1 

0.663 

0.83 

0.546 

0.0476 

0.0106 

500 

1 

0.658 

0.495 

0.651 

1 

0.992 

1 

0.989 

0.0476 

0.00518 

1000 

1 

0.726 

0.342 

0.547 

1 

1 

1 

1 

0.645 

0.538 

2000 

1 

0.75 

0.334 

0.5 

1 

1 

1 

1 

0.666 

0.75 

8  =  0.1  j  A  as  the  focal  variable. 


N 

P(A  -> 
BDe-TP 

Bp) 

BDe 

P(A  -4  C|B) 
BDe-TP  BDe 

P(B  -o-  D|B) 
BDe-TP  BDe 

P(C  D\W) 
BDe-TP  BDe 

P(C  -> 
BDe-TP 

E  |B) 

BDe 

100 

0.832 

0.471 

0.226 

0.106 

0.979 

0.911 

0.958 

0.84 

0.477 

0.441 

200 

0.827 

0.494 

0.278 

0.0367 

0.985 

0.978 

0.964 

0.972 

0.389 

0.206 

500 

0.997 

0.747 

0.961 

0.505 

1 

1 

1 

1 

0.697 

0.736 

1000 

0.995 

0.75 

0.948 

0.5 

1 

1 

1 

1 

0.961 

0.75 

2000 

1 

0.75 

0.99 

0.5 

1 

1 

1 

1 

0.986 

0.75 

S  =  0.5, 

A  as  the  focal  variable. 

100 

1 

0.586 

0.832 

0.57 

0.999 

0.916 

0.961 

0.878 

0.0882 

0.0171 

200 

1 

0.676 

0.992 

0.642 

1 

0.999 

1 

0.999 

0.47 

0.113 

500 

1 

0.746 

1 

0.507 

1 

1 

1 

1 

0.963 

0.739 

1000 

1 

0.744 

1 

0.513 

1 

1 

1 

1 

0.932 

0.731 

2000 

1 

0.75 

1 

0.5 

1 

1 

1 

1 

0.994 

0.75 

Proof:  Similar  to  the  proof  of  Theorem  9.  □ 

In  deriving  Eq.s  (3.30),  (3.32),  and  (3.34),  we  have  assumed  that  mecha¬ 
nism  changes  occurred  at  different  variables.  The  situations  in  which  different 
mechanism  changes  happen  at  a  same  variable  can  be  easily  incorporated.  For 
example,  in  experimental  studies,  we  may  set  a  variable  to  different  values.  For 
this  case,  Eq.  (3.34)  is  still  applicable  while  as  expressed  in  Eq.  (3.33) 

should  exclude  all  experimental  data  for  which  VH  is  set  to  some  fixed  value. 

In  summary,  to  compute  the  marginal  likelihood  for  dynamic  data,  we  just 
need  to  provide  an  appropriate  prior  P(Hg|G,  £)  to  reflect  our  knowledge  on  how 
those  data  came  about.  We  demonstrated  this  method  with  several  priors  given 
in  Eqs.  (3.14),  (3.29),  (3.31)  and  (3.25). 
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3.5.6  Experimental  results 


We  tested  the  BDe_TP  score  with  data  generated  from  a  known  network,  the 
Cancer  Bayesian  network.3  We  assumed  a  uniform  prior  distribution  over  all 
possible  network  structures.  We  used  the  parameters:  aVi-pai  =  1  jr^,  where  n 
is  the  number  of  states  of  Vt  and  qt  is  the  number  of  states  of  Pa*,  which  satisfies 
the  likelihood-equivalence  requirement  [Gei95]. 


A  mechanism  change  at  a  variable  Vj  is  simulated  as  follows.  Consider  pa¬ 
rameters  in  6pai.  If  dViUpai  <  0.5  then  let  0'Vil.pa.  =  0ViUpai  +  S,  else  let  0'Vn.pa.  = 
9v%l-pa%  —  6,  where  5  is  a  parameter  for  adjusting  the  change  magnitude.  The  rest 
of  the  parameters  in  0pai  are  changed  in  proportional  to  their  original  values  as: 
Q'Vij- pai  =  ^9Vij;pai,j  =  2 ,...,ru  where  a  =  (1  -  0'Va.pai)/{l  -  9Vii;pc 

simulate  a  mechanism  change  at  Vi,  we  change  parameters  in  Qpai 
each  pai  E  Dm(Pai). 


).  When  we 
as  above  for 


The  Cancer  network  is  shown  in  Figure  3.1(a).  It  has  only  5  nodes,  hence 
we  can  exhaustively  go  through  all  29,281  possible  structures  to  compute  the 
Bayesian  average  of  any  hypothesis  of  interest  and  to  find  the  graphs  with  the 
maximum  posterior  probabilities.  We  computed  the  probability  of  each  edge  in 
the  true  Cancer  network  as  in  Eq.  (3.11),  and  compared  the  results  given  by  the 
BDe_TP  metric  (3.21)  with  that  by  the  BDe  metric  (3.24).  We  experimented  with 
6  values  of  0.1  and  0.5,  and  focal  variables  B  and  A  respectively,  and  generated 
a  TP  dataset  B Tp  =  { D° ,  D1}  for  each  case  by  first  generating  2000  cases  from 
the  original  network  as  D°,  then  simulating  a  mechanism  change,  and  finally 
generating  another  2000  cases  as  D1. 


The  results  are  shown  in  Table  3.2  for  the  first  N  cases  in  the  dataset  (N  from 
D°  and  N  from  D1).  When  using  the  BDe  metric,  the  Cancer  network  and  its 
independence-equivalent  graphs  of  Figure  3.1(b)-(d)  obtain  the  maximum  score 
when  the  sample  size  is  large  enough,  and  they  obtain  a  much  larger  posterior  than 
all  other  structures.  P(A  -»  £?|B)  goes  to  0.75  because  three  of  the  four  graphs 
of  Figure  3.1(a)-(d)  have  the  edge  A  -»  B  and  we  assumed  a  uniform  distribution 
over  structures.  For  the  same  reason,  with  the  BDe  metric,  P(A  — >  C|B)  goes  to 
1/2,  P(B  — >  Z7|B)  and  P(C  ->■  D\D)  goes  to  1,  and  P{C  ->  EjB)  goes  to  3/4. 
When  using  the  BDe_TP  metric  and  B  as  the  focal  variable,  the  posterior  over 
structures  concentrated  sharply  around  the  three  13-transition  equivalent  graphs 
of  Figure  3.1(e)-(g)  when  the  sample  size  is  large.  Hence  with  the  increasing 
sample  size,  P(A  — >  B |B)  goes  to  1,  P(A  — >  CjB)  goes  to  1/3,  and  P(C  -4  E |B) 
goes  to  2/3.  With  A  as  the  focal  variable,  the  BDe_TP  score  concentrated  sharply 


3We  used  the  version  downloaded  from  the  web  site  of  Norsys  Software  Corporation, 
http:/ / www.  norsys.  com. 
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around  the  unique  Cancer  network  (see  Figure  3.1(h))  for  large  sample  size,  and 
the  posteriors  of  all  five  edges  go  to  1. 


3.6  Conclusion 

We  proposed  a  new  method  of  discovering  causal  structures,  based  on  the  de¬ 
tection  of  local,  spontaneous  changes  in  the  underlying  data-generating  model. 
We  analyzed  the  classes  of  structures  that  are  equivalent  relative  to  a  stream 
of  distributions  produced  by  local  changes,  and  devised  algorithms  that  output 
graphical  representations  of  these  equivalence  classes.  We  derived  expressions  for 
the  Bayesian  score  that  a  causal  structure  should  obtain  from  streams  of  data 
produced  by  locally  changing  distributions. 

We  have  demonstrated,  using  simulated  data,  that  the  use  of  information 
about  local  changes  may  improve  the  power  of  discovery  up  to  the  theoretical 
limits  set  by  statistical  indistinguishability.  The  major  advantage  of  the  Bayesian 
treatment  of  local  changes  in  Section  3.5,  vis-a-vis  the  purely  topological  approach 
in  Section  3.4,  lies  in  that  the  Bayesian  score  is  less  sensitive  to  topological  errors 
(e.g.,  remote  descendants  of  focal  variables  that  do  not  change).  On  the  other 
hand,  the  Bayesian  method  is  more  computation  intensive;  hybrid  schemes  remain 
to  be  investigated. 
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CHAPTER  4 


Testable  Implications  of  Causal  Models 

4.1  Introduction 

It  is  known  that  the  statistical  information  encoded  in  a  causal  model  is  com¬ 
pletely  captured  by  conditional  independence  relationships  among  the  variables 
when  all  variables  are  observable  [PGV90].  However,  when  a  causal  model  in¬ 
vokes  unobserved  variables,  or  hidden  variables,  the  network  structure  may  im¬ 
pose  equality  and  inequality  constraints  on  the  distribution  of  the  observed  vari¬ 
ables,  and  those  constraints  may  not  be  expressed  as  conditional  independencies 
[SGS93,  Pea95b].  [VP90]  gave  an  example  of  non-independence  equality  con¬ 
straints  shown  in  Figure  4.1(a),  in  which  U  is  unobserved.1  A  simple  analysis 
shows  that  the  quantity  P{d\a,  b,  c)P(b\a)  is  not  a  function  of  a,  i.e. , 

J2P(d\a,b,c)P(b\a)  =  f(c,d).  (4.1) 

b 

This  constraint  holds  even  though  no  restrictions  are  made  on  the  domains  of 
the  variables  involved  and  on  the  class  of  distributions  involved.  This  chapter 
develops  a  systematic  way  of  finding  such  functional  constraints. 

Finding  non-independence  constraints  is  useful  both  for  empirically  validating 
causal  models  and  for  distinguishing  causal  models  with  the  same  set  of  condi¬ 
tional  independence  relationships  among  the  observed  variables.  For  example, 
the  two  networks  in  Figure  4.1(a)  and  (b)  encode  the  same  set  of  independence 
statements  ( A  is  independent  of  C  given  D),  but  they  are  empirically  distinguish¬ 
able  due  to  Verma’s  constraint  (4.1).  A  structure-learning  algorithm  driven  by 
conditional  independence  relationiships  would  not  be  able  to  distinguish  between 
the  two  models  unless  the  constraint  stated  in  Eq.  (4.1)  is  tested  and  incorporated 
into  the  model-selection  strategy. 

Algebraic  methods  for  finding  equality  and  inequality  constraints  implied  by 
Bayesian  networks  with  hidden  variables  have  been  presented  in  [GM98,  GM99]. 
Those  methods  assume  a  priori  fixed  domains  and  are  limited  to  small  networks 

xWe  use  dashed  arrows  for  edges  connected  to  hidden  variables. 
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(a)  (b) 


Figure  4.1:  The  network  (a)  imposes  functional  constraints;  the  network  (b) 
encodes  the  same  set  of  independence  statements  as  (a)  but  does  not  impose 
functional  constraints. 

with  small  number  of  probabilistic  parameters  due  to  high  computational  de¬ 
mand.  This  chapter  deals  with  conditional  independence  constraints  and  func¬ 
tional  constraints,  the  type  of  constraints  imposed  by  a  network  structure  alone, 
regardless  the  domains  of  the  variables  and  the  class  of  distributions.  The  condi¬ 
tional  independence  constraints  can  be  read  via  the  d-separation  criterion  [Pea88], 
but  there  is  no  general  graphical  criterion  available  for  Verma  type  functional  con¬ 
straints  that  are  not  captured  by  conditional  independencies  [RW97,  Des99].  This 
chapter  shows  how  the  observed  distribution  factorizes  according  to  the  network 
structure,  establishes  relationships  between  this  factorization  and  Verma-type 
constraints,  and  presents  a  procedure  that  systematically  finds  these  constraints. 

The  chapter  is  organized  as  follows.  Section  4.2  shows  how  functional  con¬ 
straints  emerge  in  the  presence  of  hidden  variables.  Section  4.3  shows  how  the  ob¬ 
served  distribution  factorizes  according  to  the  network  structure  and  introduces 
the  concept  of  c-component,  which  plays  a  key  role  in  identifying  constraints. 
Section  4.4  presents  a  procedure  for  systematically  identifying  constraints.  Sec¬ 
tion  4.5  shows  that,  for  the  purpose  of  finding  constraints,  instead  of  dealing  with 
models  with  arbitrary  hidden  variables,  we  can  work  with  a  simplified  model  in 
which  each  hidden  variable  is  a  root  node  with  two  observed  children.  Section  4.6 
concludes  the  chapter. 

4.2  Functional  Constraints 

Letting  V  —  {Vi, . . . ,  Vn}  and  U  =  {Ui,  stand  for  the  sets  of  observed 

and  hidden  variables  respectively,  the  observed  probability  distribution  P(v)  is 
given  by  Eq.  (1.3).  Since  all  the  factors  of  non-ancestors  of  V  can  be  summed 
out  from  Eq.  (1.3),  letting  U'  be  the  set  of  variables  in  U  that  are  ancestors  of 
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V,  Eq.  (1.3)  then  becomes 


pm =in  P(vi\paVi)  ]^[  P{ui\paUi).  (4.2) 

u'  v,ev  Ui£U' 

Therefore,  we  can  remove  from  the  network  G  all  the  hidden  variables  that  are 
not  ancestors  of  any  V  variables,  and  we  will  assume  that  each  U  variable  is  an 
ancestor  of  some  V  variable. 

To  illustrate  how  functional  constraints  emerge  from  the  factorization  of  (4.2), 
we  analyze  the  example  in  Figure  4.1(a).  For  any  set  S  C  V”,  let  QfS1]^)  denote 
the  following  function2 

«[«](»)=£  n  P(v,lpaVi)  J"J  P{ui\paUt).  (4.3) 

u  Plies'}  {i\Ui£U} 

In  particular,  we  have  Q[V](v)  =  P(v)  and,  for  consistency,  we  set  Q[ty\{v)  —  1, 
since  Yhu  Tlpii/jet/}  piui\Paui)  =  1-  For  convenience,  we  will  often  write  <3[<S](i;) 
as  Q[S].  For  Figure  4.1(a),  Eq.  (4.2)  becomes 

P(a,  6,  c,  d)  =  P(a)P(c\b)Q[{B,  D}},  (4.4) 

where 

Q[{B,  D }]  =  Y,  p(b It  u)p(d\c,  u)P(u).  (4.5) 


From  (4.4),  we  obtain 

<3[{B,  B}]  =  =  P(d |o.  i,  c)P(Ma),  (4.6) 

and  from  (4.5), 

Q[p}]  =  £  P(d\c,u)P(u)  (4.7) 

U 

=  £<3[{b,J3)J 

b 

=  Yp(d\a>b’c)p(b\a)-  (4-8) 

b 

Eq.  (4.7)  implies  that  Q[{D}}  is  a  function  only  of  c  and  d,  therefore  Eq.  (4.8) 
induces  a  constraint  that  the  quantity  p(d\a,  b,  c)P{b\a)  is  independent  of  a. 

2Q[S](v)  can  be  interpreted  as  Q[S](iO  =  Pv\s(s). 
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Note  that  the  key  to  obtaining  this  constraint  rests  with  our  ability  to  ex¬ 
press  Q[{B,  D }]  and  Q[{D}}  in  terms  of  observed  quantities  (see  (4.6)  and  (4.8)), 
namely  quantities  not  involving  U.  Applying  the  same  analyses  to  Figure  4.1(b), 
we  have  that  Q[{D}]  gives  the  same  expression  as  in  Eq.  (4.8),  but  now  Q[{D }]  = 
Y^u  P(d\c ,  a,  u)P(u)  is  also  a  function  of  a,  and  no  Verma  constraint  is  induced. 
In  general,  for  any  set  S  C  V,  Q[5]  in  Eq.  (4.3)  is  a  function  of  values  only  of 
a  subset  of  V.  Therefore,  whenever  Q[A]  is  computable  from  the  observational 
distribution  P(v),  it  may  lead  to  some  constraints  —  conditional  independence 
relations  or  Verma-type  functional  constraints.  In  the  rest  of  the  chapter,  we  will 
show  how  to  systematically  find  computable  Q[5],  but  first,  we  study  what  the 
arguments  of  Q[S]  are. 

For  any  set  C,  let  Gc  denote  the  subgraph  of  G  composed  only  of  variables 
in  C,  let  An(C )  denote  the  union  of  C  and  the  set  of  ancestors  of  the  variables  in 
C,  and  let  Anu(C)  =  An(C)  Pi  U  denote  the  set  of  hidden  variables  in  An(C).  In 
Eq.  (4.3),  the  factors  corresponding  to  the  hidden  variables  that  are  not  ancestors 
of  S  in  the  subgraph  GSuu  can  be  summed  out,  and  letting  U(S)  =  Anu(S)cs uu 
be  the  set  of  hidden  variables  that  are  ancestors  of  S  in  the  graph  Gsuu,  <3 [5] 
can  be  written  as 

ep]=E  n  P(vi\paVi)  JJ  P(ui\paUi).  (4.9) 

u(S)  {*|Vj65}  {i\UieU(S)} 

We  see  that  <2  [S']  is  a  function  of  S,  the  observed  parents  of  S,  and  the  observed 
parents  of  U(S).  We  will  call  an  observed  variable  Vt  an  effective  parent  of  an 
observed  variable  V)  if  Vt  is  a  parent  of  V)  or  if  there  is  a  directed  path  from  Vt 
to  Vj  in  G  such  that  every  internal  node  on  the  path  is  a  hidden  variable.  For 
any  set  S  C  V,  letting  Pa+(S )  denote  the  union  of  S  and  the  set  of  effective 
parents  of  the  variables  in  S',  then  we  have  that  Q[S]  is  a  function  of  Pa+(S). 
Assuming  that  Q[S]  is  a  function  of  some  set  T,  when  <3[5(](t)  is  computable  from 
P(v),  its  expression  obtained  may  be  a  function  of  values  of  some  set  T'  larger 
than  T  (T  C  T'),  and  this  will  lead  to  constraints  on  the  distribution  P(v)  that 
the  expression  obtained  for  Q[S]  is  independent  of  the  values  t'  \  t,  which  could 
be  a  Verma-type  functional  constraint  or  be  a  set  of  conditional  independence 
statements. 

Next  we  give  a  lemma  that  will  facilitate  the  computation  of  Q[S]  and  the 
proof  of  other  propositions.  The  lemma  provides  a  condition  under  which  we  can 
compute  Q[W]  from  Q[C],  where  W  is  a  subset  of  C,  by  simply  summing  Q[C ] 
over  the  remaining  variables  (in  C  \  W).  For  any  set  C,  let  Anv(C )  =  An{C)  fl  V 
be  the  set  of  observed  variables  in  An(C),  and  let  Dev{C)  denote  the  set  of 
observed  variables  that  are  in  C  or  are  descendants  of  any  variable  in  C.  A 
set  A  C  V  is  called  an  ancestral  set  if  it  contains  its  own  observed  ancestors 
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[A  —  Anv(A )),  and  a  set  A  C  V  is  called  a  descendent  set  if  it  contains  its 
own  observed  descendants  (.4  =  Dev(A )).  Letting  G(C)  =  Gcuu(c )  denote  the 
subgraph  of  G  composed  only  of  variables  in  C  and  U(C)  which  corresponds  to 
the  quantity  Q[C)  (see  Eq.  (4.9)),  then  we  have  the  following  lemma. 

Lemma  2  Let  W  C  C  C  V,  and  W'  =  C  \  W .  If  W  is  an  ancestral  set  in 
G{C)  (W  =  Anv  (W)g(c)),  or  equivalently,  if  W'  is  a  descendent  set  in  G(C) 
(W'  =  Dev(W')G{C)),  then 

J2Q{C]  =  Q[W].  (4.10) 

w' 

Proof  sketch:  By  Eq.  (4.9) 

Eeid^EEII  P{Gi\pO>vi)  P  ('U'ilPQ'Ui')  ■  (4>  11) 

w'  w'  u(C)VieC  Ui€U(C) 

All  factors  in  (4.11)  corresponding  to  the  variables  (observed  or  hidden)  that  are 
not  ancestors  of  W  in  G(C)  are  summed  out,  and  we  obtain 

e«ic’]=  e  n  P(vi\pavI)  P{^i  \poUi ) .  (4.12) 

w'  Anu(w)a(c)  View  UieAnu(W)G{c) 

We  have  Anu(W)G(c)  —  Anu(W)Gw uu  =  U(W)  due  to  that  W  is  an  ancestral 
set.  Therefore  the  left  hand  side  of  (4.12)  is  equal  to  Q[W]  by  Eq.  (4.9).  □ 

In  the  next  section,  we  show  how  the  distribution  P(v)  decomposes  according 
to  the  network  structure  and  how  the  decomposition  helps  the  computation  of 
Q[S). 

4.3  C-components 

P(v)  as  a  summation  of  products  in  (4.2)  may  sometimes  be  decomposed  into  a 
product  of  summations.  For  example,  in  Figure  4.2,  P(v)  can  be  written  as 

P(v  i,v2,v3)v4)  = 

Ul 

(  ^  P(v2\u2,U3)P{vi\v5,U2)P{u2)P(u3\Vi)) 

U2>U3 

=  Q[{Vi,V3}]Q[{V2,V. 4}]  (4.13) 
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Figure  4.2:  The  graph  is  partitioned  into  c-components  { V ,  V:i }  and  {V2,  V4}. 

The  importance  of  this  decomposition  lies  in  that  both  terms  Q[{Vi,  V3}]  and 
Q[{V 2,  V4}]  are  computable  from  P(v)  as  shown  later.  First  we  study  graphical 
conditions  under  which  this  kind  of  decomposition  is  feasible. 

Assume  that  P{y)  in  Eq.  (4.2)  can  be  decomposed  into  a  product  of  summa¬ 
tions  as: 

p(v)  =  j]  p{vi\PaVi)  n(En  P(vi \p&vi )  1 1  P {,rU‘i\pQ'Ui)\  (4-14) 

Vies0  j  '  rij  VieSj  Ui£Nj  ' 

where  the  variables  in  S°  have  no  hidden  parents,  U  is  partitioned  into  Nj’s,  and 
V  \  S°  is  partitioned  into  S/s.  Ut  and  Uj  must  be  in  the  same  set  Nk  if  (i)  there 
is  an  edge  between  them  (C/*  — »  Uj  or  t/j  Uj),  or  (ii)  they  have  a  common 
child  (Ui  — y  Ui  i —  Uj  or  U%  — >  Vi  <—  Uj).  Repeatedly  applying  these  two  rules,  we 
obtain  that  Ui  and  Uj  are  in  the  same  set  Nt  if  there  exists  a  path  between  Ui  and 
Uj  in  G  such  that  (i)  every  internal  node  of  the  path  is  in  U,  or  (ii)  every  node 
in  V  on  the  path  is  head-to-head  (— >  V)  <— ).  It  is  clear  that  this  relation  among 
Ui’ s  is  reflexive,  symmetric,  and  transitive,  and  therefore  it  defines  a  partition 
of  U.  We  construct  S’,  as  follows:  a  variable  Vk  €  V  is  in  Si  if  it  has  a  hidden 
parent  that  is  in  iV).  Si  s  form  a  partition  of  V  \  S°  since  A^’s  form  a  partition  of 
U.  Let  each  variable  V)  G  5°  form  a  set  by  itself  5°  =  {Vi}.  We  have  that  Si  s 
and  S',0’ s  form  a  partition  of  V.  It  is  clear  that  if  a  hidden  variable  Uk  is  not  in 
Nj,  then  it  does  not  appear  in  the  factors  of  Hy-gs,  P(vi\Pavi)  P(u%\Paui), 

hence  the  decomposition  of  P(v)  in  Eq.  (4.14)  follows.  We  will  call  each  S*  or 
S°  a  c-component  (abbreviating  “confounded  component”)  of  V  in  G  or  simply 
c-component  of  G. 

Assuming  that  V  is  partitioned  into  c-components  Si, ...  ,Sk,  Eq.  (4.14)  can 
be  rewritten  as 

P{v)  =  Q[V]  =  \{Q[Si),  (4.15) 
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which  follows  from 

«R] = e  n  P(vi\paVi )  n  P{ui\paUi ) 

“  {ilViGSj}  {i\UiEU} 

=En  P(vi\paVi)  ]^J  P(ui\paUi)  e  n  P[ui  |  paUi) 

rij  ViGSj  UieNj  u\nj  UieU\Nj 

=  £  II  P(Vi\PaVi)  IT  P(.Ui\PaUi),  V4-16) 

v^Sj  Ui£Nj 

where  we  have  used  the  following  formula 

e  n  P{ui\paUi)  =  1,  for  any  W  C  U .  (4.17) 

w  {i\Ui&W} 

We  will  call  Q[Si]  the  c-f actor  corresponding  to  the  c-component  Si .  For  exam¬ 
ple,  Figure  4.1(a)  is  partitioned  into  c-components  { A },  {C},  and  {B,D},  with 
corresponding  c-factors  Q[{4}]  =  P(a),  Q[{C}}  —  P(c\b),  and  Q[{B,  D}}  in  (4.5) 
respectively,  and  P(v)  can  be  written  as  a  product  of  c-factors  as  in  Eq.  (4.4).  In 
Figure  4.2,  V  is  partitioned  into  c-components  { Vi ,  V3 }  and  and  P(v) 

can  be  written  as  a  product  of  c-factors  Q[{Vi,  V3}]  and  Q[{V: 2,  \\})  as  in  (4.13). 

The  importance  of  the  c-factors  stems  from  that  all  c-factors  are  computable 
from  P(v).  We  generalize  this  result  to  proper  subgraphs  of  G  and  obtain  the 
following  lemma. 

Lemma  3  Let  H  C  V,  and  assume  that  H  is  partitioned  into  c-components 
in  the  subgraph  G(H )  =  Ghuu(h)-  Then  we  have 

(%)  Q[H )  decomposes  as 

Q[H]  =  l[Q[Hi}.  (4.18) 

i 


(ii)  Let  k  be  the  number  of  variables  in  H ,  and  let  a  topological  order  of  the 
variables  in  IL  be  <■■■  <  Vhk  in  G(H).  Let  . . . ,  I4J  be  the  set 

of  variables  in  H  ordered  before  I4t  (including  14 1)>  i  =  1, . . .  ,k,  and  H(0)  =  0. 
Then  each  Q[Hj],  j  =  1, .  . . ,  l,  is  computable  from  Q[H ]  and  is  given  by 


Q[Hj ]  = 


n 

FI  vhieH,} 


Q[H W] 

q[h^y 


(4.19) 


where  each  Q[H^),  i  =  0, 1, . . . ,  k,  is  given  by 


Q[H «]  -  £  Q[H\. 

h\h^) 


(4.20) 
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(iii)  Each  Q[H^}/Q[H^  is  a  function  only  of  Pa+(X)),  where  Tj  is  the 
c-component  of  the  subgraph  G(H^)  that  contains  V)H . 

Proof:  (i)  The  decomposition  of  Q[H]  into  Eq.  (4.18)  follows  directly  from  the 
definition  of  c-component  (see  Eqs.  (4.14)-(4.17)). 

(ii)&(iii)  Eq.  (4.20)  follows  from  Lemma  2  since  each  is  an  ancestral  set. 
We  prove  (ii)  and  (iii)  simultaneously  by  induction  on  k. 

Base:  k  =  1.  There  is  one  c-component  Q[Hi]  —  Q[H ]  =  Q[H^]  which 
satisfies  Eq.  (4.19)  because  Q[0]  =  1,  and  Q[Hf\  is  a  function  of  Pa+(Hi). 

Hypothesis:  When  there  are  k  variables  in  H,  all  Q\Hf\' s  are  computable  from 
Q[H]  and  are  given  by  Eq.  (4.19),  and  (iii)  holds  for  i  from  1  to  k. 

Induction  step:  When  there  are  k  - 1-1  variables  in  H,  assuming  that  the 
c-components  of  G(H )  are  Hi, ... ,  Hrn,  H',  and  that  Vhk+1  G  H' ,  we  have 

Q[H]  =  Q[H{k+1)]  =  Q[H'\  n  Q[Hi\.  (4.21) 


Summing  both  sides  of  (4.21)  over  14jb+1  leads  to 

£  cm  =  e[ff(‘>]  =  ( £  q[h'))  n  am,  (4.22) 

Vhk+1  V*k+ 1  » 

where  we  have  used  Lemma  2.  It  is  clear  that  each  Hi,  i  =  1, . . . ,  m,  is  a  c- 
component  of  the  subgraph  G(iL^).  Then  by  the  induction  hypothesis,  each 
Q[Hi\,i  =  1, . . . ,  m,  is  computable  from  Q[H^}  —  W  Q[H ]  and  is  given  by 

Eq.  (4.19),  where  each  Q[H^],  i  =  0, 1, . . . ,  k,  is  given  by 

Q[H(i) }  =  J2  ^H[k)\  =  E  w  (4-23) 


From  Eq.  (4.21),  Q[H'}  is  computable  as  well,  and  is  given  by 


Q[H')  = 


Q[H (fc+1)] 

Thom 


n 

qih^y 


(4.24) 


which  is  clear  from  (4.19)  and  the  chain  decomposition  Q[Hik+1i]  =  nS  ■ 

By  the  induction  hypothesis,  (iii)  holds  for  i  from  1  to  k.  Next  we  prove  that 
it  holds  for  Q[H^k+lS)\/ Q[H^].  The  c-component  of  G  that  contains  Vhk+1  is  H' . 
In  Eq.  (4.24),  Q[H'\  is  a  function  of  Pa+(H'),  and  each  term  Q[H®]/Q[H«~% 
14,  G  H1  and  VyH  4  Vhk+1,  is  a  function  of  Pa+(T*),  where  X)  is  a  c-component 
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of  the  graph  G(H{l'!)  that  contains  VjH  and  therefore  is  a  subset  of  H1 .  Hence  we 
obtain  that  Q[W^+1)]/Q[/T(^]  is  a  function  only  of  Pa+(H').  □ 

The  proposition  (iii)  in  Lemma  3  may  imply  a  set  of  constraints  to  the  distribution 
P(v)  whenever  Q[H]  is  computable  from  P(v). 

A  special  case  of  Lemma  3  is  when  H  —  V,  and  we  obtain  the  following 
corollary. 

Corollary  1  Assuming  that  V  is  partitioned  into  c-components  Si,...,Sk,  we 
have 

(i)  P(v)  =  UMSi]. 

(ii)  Let  a  topological  order  over  V  beV\  <  ...  <  Vn,  and  let  V  ^  =  {Vi, . . .  ,  V}, 
i  =  1, . . .  ,n,  and  =  0.  Then  each  Q[Sj],  j  =  1, . . . ,  k,  is  computable  from 
P(v)  and  is  given  by 

Q[Sj]  =  J]  l^)  (4-25) 

{»|V5GSj} 

(iii)  Each  factor  P(vi\v^~1'>)  can  be  expressed  as 

P(vi\v =  P(vi \pa+  (Tj)  \  {u,}),  (4.26) 

where  T \  is  the  c-component  of  G(V^)  that  contains  V, 

We  see  that  when  hidden  variables  were  invoked,  a  variable  is  independent  of 
its  non-descendants  given  its  effective  parents,  the  non-descendant  variables  in 
its  c-component,  and  the  effective  parents  of  the  non-descendant  variables  in  its 
c-component,  reminiscence  of  the  property  that  each  variable  is  independent  of 
its  non-descendants  given  its  parents  when  there  is  no  hidden  variables. 

4.4  Finding  Constraints 

With  Lemma  2,  3,  and  Corollary  1,  we  can  systematically  find  constraints  implied 
by  a  network  structure.  First  we  study  a  few  examples. 

4.4.1  Examples 

Consider  Figure  4.2,  which  has  two  c-components  {Vi,  V3}  and  {V2,  V4}.  The  only 
admissible  order  is  Vi  <  V2  <  V3  <  V4.  Applying  Corollary  1,  we  obtain  that  the 
two  c-factors  are  given  by 

Ql{V1,V3}](v1,v2,v3)  =  P(u3|u2,Ui)P(ui),  (4.27) 
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— -v2 - -V3 - \T  V3 — -  v4 

xiV'  XU''' 

(a)  G  (b)  G({V,F3,F4})  (c)  G({Ks,14}) 

Figure  4.3:  Subgraphs  for  finding  constraints. 


and 


Q[{V2,Vi}}(vi,v2,v3,v4)  =  P(va\v3)v2,Vi)P(v2\vi).  (4.28) 

They  do  not  imply  any  constraints  on  the  distribution.  Summing  both  sides  of 
(4.28)  over  V2,  by  Lemma  2,  we  obtain 

Q[{^4}](u3,  vd  =  ^(^4^3,  V2,  vi)P(v2\vi),  (4.29) 

V2 

which  implies  a  constraint  on  the  distribution  P(v)  that  the  right  hand  side  is 
independent  of  v\.  Computing  <3[{Vi}],  Q[{V2}},  and  QKV3}]  does  not  give  any 
constraints. 

Consider  Figure  4.3(a),  which  has  two  c-components  {V2}  and  S  =  {Vi ,  V3,  V4}. 
The  only  admissible  order  is  1{  <  V2  <  V3  <  V4.  Applying  Corollary  1,  we  obtain 

Q[{V2}]{vi,v2)  =  P{v2\vi),  (4.30) 

Q[5](u)  =  P(v4\v3,v2,v1)P(v3\v2,vi)P(v1).  (4.31) 

In  the  subgraph  G(S)  =  GSuU  (Figure  4.3(b)),  V  is  not  an  ancestor  of  Ii  = 
{V3,  Vi},  and  from  Lemma  2,  summing  both  sides  of  (4.31)  over  V,  we  obtain 

Q[H](v2,V3,V4)  =Y^P(V*\V3iV2’Vi)P(V3\ v2,vl)P(vl)-  (4-32) 

Vl 

The  subgraph  G(H)  =  GH uu  (Figure  4.3(c))  lias  two  c-components  {V3}  and 
{V4}.  By  Lemma  3,  we  have  Q[H ]  =  Q[{V3}]<5[{V4}],  and 

Q[{V3}](v2,v3)  =  Y,QlH)  =  X)p(v3lw2,ui  )P(t>i),  (4.33) 

V4  Vi 
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Q[H]  __  E a  «2, 

~~  E*,^#!  ~  Y,vip^\v^vi)p{vi) 

Eq.  (4.34)  implies  a  constraint  on  P(v )  that  the  right  hand  side  is  independent 
of  v2. 

From  the  preceding  examples,  we  see  that  we  may  find  constraints  by  alterna¬ 
tively  applying  Lemma  2  and  3.  Next,  we  present  a  procedure  that  systematically 
looking  for  constraints. 


4.4.2  Identifying  constraints  systematically 

Let  a  topological  order  over  V  be  V\  <  . . .  <  Vn,  and  let  Vp!  =  {Iq , . . . ,  V)}, 
i  =  1, . . .  ,n.  For  i  from  1  to  n,  at  each  step,  we  will  look  for  constraints  that 
involve  V)  and  the  variables  ordered  before  V).  At  step  i,  we  do  the  following: 

(Al)  Consider  the  subgraph  G( V^).  If  G(V^)  has  more  than  one  c-component, 
assuming  that  V)  is  in  the  c-component  Si  of  G(V^),  then  by  Corollary  1, 
Q [Si]  is  computable  from  P(v)  and  may  give  a  conditional  independence 
constraint  that  V)  is  independent  of  its  predecessors  given  its  effective  par¬ 
ents,  other  variables  in  S),  and  the  effective  parents  of  other  variables  in  Si, 
that  is,  Vi  is  independent  of  V^  \  Pa+(Si )  given  Pa+(S))  \  {Vi}. 

(A2)  Consider  Q[Sj]  in  the  subgraph  G(S4).  For  each  descendent  set  D  C  Si  (D 
contains  its  own  observed  descendants)  in  G(Si)  that  does  not  contain  Vi,3 
by  Lemma  2  we  have 

'Z,Q[Si]  =  Q[Si\D].  (4.35) 

d 

The  left  hand  side  of  (4.35)  is  a  function  of  Pa+(5,)  \  D,  while  the  right 
hand  side  is  a  function  of  Pa+{Sl  \  D)  C  Pa+(S1,)  \  D.  Therefore,  if  some 
effective  parents  of  D  are  not  effective  parents  of  Si\D,  then  (4.35)  im¬ 
plies  a  constraint  on  the  distribution  P(v)  that  the  quantity  Erf  QDE  is 
independent  of  (Pa+(Sj)  \D)\  Pa+(Si  \  D ). 

Let  D'  —  Si  \  D.  Next  we  consider  Q[D'}  in  the  subgraph  G(D').  If 
G(D')  has  more  than  one  c-component,  assuming  that  Vt  is  in  the  c- 
component  P8  of  G(D'),  by  Lemma  3,  Q[Ei]  is  computable  from  Q[D'], 
and  Q\D']/ E„-  Q\D'}  is  a  function  only  of  Po+(P3;),  which  imposes  a  con¬ 
straint  on  P(vj  if  Pa+(D')  \  Pa+(P4)  ^  0. 

3  We  need  to  consider  every  descendent  set  D  that  does  not  contain  Vi,  because  it  is  possible 
that  for  two  descendent  sets  Di  C  D-> ,  the  constraints  from  summing  P2  are  not  implied  by 
that  from  D\,  and  vice  versa. 
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(a)  G 


'  W'' 

(b)  G({VX,V 3,V5}) 


<v 

V5 


Figure  4.4:  A  model  imposing  functional  constraints. 


Finally  we  study  Q[Ei]  by  repeating  the  process  (A2)  with  5,  now  replaced 
by  Ei. 

The  preceding  analysis  gives  us  a  recursive  procedure  for  systematically  find¬ 
ing  constraints.  To  illustrate  this  process,  we  consider  the  example  in  Fig¬ 
ure  4.4(a).  The  only  admissible  order  over  V  is  V\  <  . . .  <  V5.  The  constraints 
involving  V\  to  V4  are  the  same  as  in  Figure  4.2,  and  here  we  look  for  constraints 
involving  V5.  V5  is  in  the  c-component  S  =  { Vi ,  V3 ,  T5 } .  By  Corollary  1,  Q[S]  is 
given  by 

Q[S1(t)  =  P{v5\v4,v3,V2,vi)P(v3\v2,vi)P(vi),  (4.36) 

which  implies  no  constraints.  In  the  subgraph  G(S)  (Figure  4.4(b)),  the  descen- 
dent  sets  not  containing  V5  are  {Vi},  {V3},  and  {Vi,  V3}. 

(a)  Summing  both  sides  of  (4.36)  over  vt,  we  obtain 

Q[{V 3,  v5}](u2,  v3,  Vi,  vs)  =  y3,  U2,  Vi)P(v3\v2,  Wi)-P(ui),  (4.37) 

Vl 

which  implies  no  constraints.  The  subgraph  G({V3,  V5})  is  partitioned  into  two 
c-components  {V3}  and  {V5},  and  by  Lemma  3,  we  have 

QW,Vs}l 

°l{  5},(  4’”s)  E,«[W,VS}] 

_  EV1  V3,  V2,  Vi )P{v3\v2,  Vi)P(vi)  o 

which  implies  a  constraint  that  the  right  hand  side  is  independent  of  v2  and  v3. 

(b)  Summing  both  sides  of  (4.36)  over  v3,  we  obtain 

Q[{V u  V5}](ui, n4,  v5)  =  P{v5\v4,v3,v2,v1)P(v3\v2,v1)P(v1),  (4.39) 
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which  implies  a  constraint  that  the  right  hand  side  is  independent  of  v2-  G{{\ i,  V5 }) 
can  not  be  further  partitioned  into  c-components. 

(c)  Summing  both  sides  of  (4.36)  over  V\  and  v3,  we  obtain 

Q[{V5}}(v4,v5)  =  P(vb\v^v3,V2,Vi)P{v3\v2,v1)P{vi),  (4.40) 

Vl  ,«3 

which  implies  a  constraint  that  the  right  hand  side  is  independent  of  v2.  This 
constraint  is  implied  by  that  obtained  from  Eq.  (4.38). 

4.5  Projection  to  Semi-Markovian  Models 

If,  in  a  causal  model  with  hidden  variables,  each  hidden  variable  is  a  root  node 
with  exactly  two  observed  children,  then  the  corresponding  model  is  a  semi- 
Markovian  model.  The  examples  we  have  studied  in  Figure  4.1,  4,3,  and  4.4 
are  semi-Markovian  models  while  Figure  4.2  is  not.  Semi-Markovian  models  are 
easy  to  work  with,  and  we  will  show  that  a  causal  model  with  arbitrary  hidden 
variables  can  be  converted  to  a  semi-Markovian  model  with  exactly  the  same  set 
of  constraints  (that  can  be  found  through  the  procedure  in  Section  4.4.2)  on  the 
observed  distribution  P{v). 

In  a  semi-Markovian  model,  the  observed  distribution  P(y)  is  given  by  Eq.  (1.5). 
And  the  function  Q[5](i>)  in  (4.3)  becomes 

<2[S](w)  =  E  II  P^\paVi)Y[P{Ui).  (4.41) 

«  {j|V,eS}  i 

The  appearance  of  hidden  variables  is  represented  by  bidirected  edges  in  the 
causal  graph  of  a  semi-Markovian  model.  It  is  easy  to  partition  a  graph  with 
bidirected  edges  into  c-components.  Let  a  path  composed  entirely  of  bidirected 
edges  be  called  a  bidirected  path.  Two  observed  variables  are  in  the  same  c- 
component  if  and  only  if  they  are  connected  by  a  bidirected  path.  Letting  Pa(S) 
denote  the  union  of  S  and  the  set  of  parents  of  S,  then  it  is  clear  that  Q[S]  is 
a  function  of  Pa(S).  For  semi-Markovian  models,  Lemma  2  and  3  still  hold,  in 
which  G(C)  ( G(H ))  will  be  replaced  by  Gc  ( Gh ),  and  Pa+(-)  replaced  by  Pa(-). 

A  causal  model  with  arbitrary  hidden  variables  can  be  converted  to  a  semi- 
Markovian  model  by  constructing  its  projection  [Ver93] . 

Definition  5  (Projection)  The  projection  of  a  DAG  G  over  V  UU  on  the  set 
V,  denoted  by  PJ(G,V),  is  a  DAG  over  V  with  bidirected  edges  constructed  as 
follows: 
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1.  Add  each  variable  in  V  as  a  node  of  PJ(G,  V). 

2.  For  each  pair  of  variables  X ,  Y  E  V,  if  there  is  an  edge  between  them  in  G, 
add  the  edge  to  PJ(G,V). 

3.  For  each  pair  of  variables  X ,  Y  e  V,  if  there  exists  a  directed  path  from 
X  to  Y  in  G  such  that  every  internal  node  on  the  path  is  in  U ,  add  edge 
X  — ^  Y  to  PJ{G ,  V)  (if  it  does  not  exist  yet), 

4.  For  each  pair  of  variables  X ,  Y  E  V,  if  there  exists  a  divergent  path  between 

X  and  Y  in  G  such  that  every  internal  node  on  the  path  is  in  U  (X  — 
Ui  — *  Y ),  add  a  bidirected  edge  X  «- - ->  Y  to  PJ(G,  V). 

It  is  shown  in  [Ver93]  that  G  and  PJ(G,V )  have  the  same  set  of  conditional 
independence  relations  among  V.  Next  we  show  that  the  procedure  presented  in 
Section  4.4.2  will  find  the  same  sets  of  constraints  on  P(v)  in  G  and  PJ(G,  V). 
To  this  purpose,  we  need  to  show  that  for  any  set  H  C  V,  G  and  PJ(G,  V)  have 
the  same  arguments  for  Q[H],  the  same  topological  relations  over  H,  and  the 
same  sets  of  c-components. 

Lemma  4  For  any  set  H  CV ,  Q[H)  has  the  same  arguments  in  G  and  PJ(G,  V), 
that  is,  Pa+(H)  in  G  is  equal  to  Pa(H)  in  PJ(G,  V ). 

Lemma  4  is  obvious  from  Definition  5. 

Lemma  5  For  any  set  H  C  V ,  and  any  two  variables  Vi,  Vj  E  H,  V  is  an 
ancestor  ofVj  in  G(H)  if  and  only  if  Vi  is  an  ancestor  ofVj  in  PJ(G,V)h  (the 
subgraph  of  PJ(G,V)  composed  only  of  variables  in  Id). 

Lemma  5  has  been  shown  in  [Ver93j. 

Lemma  6  For  any  set  H  C  V,  G{H )  is  partitioned  into  the  same  set  of  c- 
components  as  PJ(G,V)h- 

Proof:  (1)  If  two  variables  X,  Y  E  H  are  in  the  same  c-component  in  PJ(G ,  V)h, 
then  there  is  a  bidirected  path  between  X  and  Y  in  PJ{G ,  V)h- 

X  e - -»  •  '  '  *■ - ->  Vi  *■ - ->  ■  •  ■  <- - ->  Y 

From  the  definition  of  a  projection,  there  is  a  path  between  X  and  Y  in  G(H) 
on  which  each  observable  is  head-to-head: 

X  *■—  Ut  — ->  Vj  4—  Vi  <—•••—>  Vk  <r—  Um  — ->  I 
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Therefore  X  and  Y  are  in  the  same  c-component  in  G(H). 

(2)  If  X,Y  G  H  are  in  the  same  c-component  in  G(H),  then  there  exist  Ut 
and  Uj  such  that  Ut  is  a  parent  of  X,  Uj  is  a  parent  of  Y,  and  U,,  —  Uj  or  there 
is  a  path  p  between  Ut  and  Uj  such  that  every  observable  on  p  is  head-to-head 
and  every  hidden  variable  on  p  is  in  U(H).  We  prove  that  X  and  Y  are  in  the 
same  c-component  in  PJ(G,  V)h  by  induction  on  the  number  k  of  head-to-head 
nodes  on  p. 

Base:  k  —  0.  There  is  no  head-to-head  node  on  p,  then  there  is  a  divergent 
path  between  X  and  Y  in  G: 

X  i - Uk  -> - ■>  Y. 

Therefore  there  is  a  bidirected  edge  X  *■ - •*  Y  in  PJ(G,V)h,  and  X  and  Y 

are  in  the  same  c-component  in  PJ(G,  V)h- 

Induction  hypothesis:  If  there  are  k  head-to-head  nodes  on  p,  X  and  Y  are 
in  the  same  c-component  in  PJ(G,V)h- 

If  there  are  k  +  1  head-to-head  nodes  on  p,  let  W  be  the  head-to-head  node 
closest  to  X  on  p.  If  W  is  an  observable,  let  V{  =  W,  otherwise  let  Vi  be  an 
observable  descendant  of  W  such  that  there  is  a  directed  path  from  W  to  Vi  on 
which  all  internal  nodes  are  hidden  variables.  From  the  base  case,  X  and  Vi  are 
in  the  same  c-component  in  PJ(G,  V)H,  and  from  the  induction  hypothesis,  Vt 
and  Y  are  in  the  same  c-component  in  PJ(G ,  V)h,  hence  we  have  that  X  and  Y 
are  in  the  same  c-component  in  PJ(G,  V)h-  □ 

By  Lemma  4-6,  we  conclude  that  the  procedure  presented  in  Section  4.4.2  will 
find  the  same  sets  of  constraints  on  P(v)  in  G  and  PJ(G,V).  Since  it  is  easier 
to  work  in  a  semi-Markovian  model,  we  can  always  convert  a  Bayesian  network 
with  arbitrary  hidden  variables  to  a  semi-Markovian  model  before  searching  for 
constraints  on  the  distribution  P(v). 

4.6  Conclusion 

This  chapter  develops  a  systematic  procedure  of  identifying  functional  constraints 
induced  by  causal  Bayesian  networks  with  hidden  variables.  The  procedure  can 
be  used  for  devising  tests  for  validating  causal  models,  and  for  inferring  the 
structures  of  such  models  from  observed  data.  At  this  stage  of  research  we  cannot 
ascertain  whether  all  functional  constraints  can  be  identified  by  our  procedure; 
however,  we  could  not  rule  out  this  possibility. 
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CHAPTER  5 


Identification  of  Causal  Effects 

5.1  Introduction 

This  chapter  explores  the  feasibility  of  inferring  cause  effect  relationships  from 
various  combinations  of  data  and  theoretical  assumptions.  The  assumptions  con¬ 
sidered  will  be  represented  in  the  form  of  an  acyclic  causal  diagram  which  con¬ 
tains  both  arrows  and  bi-directed  arcs  [Pea95a,  PeaOO].  The  arrows  represent 
the  potential  existence  of  direct  causal  relationships  between  the  corresponding 
variables,  and  the  bi-directed  arcs  represent  spurious  dependencies  due  to  un¬ 
measured  confounders.  Our  main  task  will  be  to  decide  whether  the  assumptions 
represented  in  any  given  diagram  are  sufficient  for  assessing  the  strength  of  causal 
effects  from  nonexperimental  data  and,  if  sufficiency  is  proven,  to  express  the  tar¬ 
get  causal  effect  in  terms  of  estimable  quantities. 

It  is  well  known  that,  in  the  absence  of  unmeasured  confounders,  all  causal 
effects  are  identifiable,  that  is,  the  joint  response  of  any  set  Y  of  variables  to 
intervention  on  a  set  T  of  treatment  variables  Pfiy)  can  be  estimated  consistently 
from  nonexperimental  data  [Rob86,  SGS93,  Pea93].  If  some  confounders  are  not 
measured,  then  the  question  of  identifiability  arises,  and  whether  the  desired 
quantity  can  be  estimated  depends  critically  on  the  precise  locations  (in  the 
diagram)  of  those  confounders  vis  a  vis  the  sets  T  and  Y .  Sufficient  graphical 
conditions  for  ensuring  the  identification  of  Pt(y)  were  established  by  several 
authors  [SGS93,  Pea93,  Pea95a]  and  are  summarized  in  [PeaOO,  Chapters  3  and 
4],  For  example,  a  criterion  called  “back-door”  permits  one  to  determine  whether 
a  given  causal  effect  Pfiy)  can  be  obtained  by  “adjustment”,  that  is,  whether  a 
set  C  of  covariates  exists  such  that 

Pt(y)  =  'Z,P(y\c,t)P(c)  (5.1) 

c 

When  there  exists  no  set  of  covariates  that  is  sufficient  for  adjustment,  causal 
effects  can  sometimes  be  estimated  by  invoking  multi-stage  adjustments,  through 
a  criterion  called  “front-door”  [Pea95a].  More  generally,  identifiability  can  be 
decided  using  do-calculus  derivations  [Pea95a],  that  is,  a  sequence  of  syntactic 
transformations  capable  of  reducing  expressions  of  the  type  Pfiy)  to  subscript-free 
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expressions.  Using  do-calculus  as  a  guide,  [GP95]  devised  a  graphical  criterion  for 
identifying  Px(y)  (where  X  and  Y  are  singletons)  that  combines  and  expands  the 
“front-door”  and  “back-door”  criteria  (see  [PeaOO,  pp.  114-8]). 1  [PR95]  further 
derived  a  graphical  condition  under  which  it  is  possible  to  identify  Pt{y )  where  T 
consists  of  an  arbitrary  set  of  variables.  This  permits  one  to  predict  the  effect  of 
time  varying  treatments  from  longitudinal  data,  in  the  presence  of  unmeasured 
confounders,  some  of  which  are  affected  by  previous  treatments.  This  criterion 
was  further  extended  by  [Rob97b]  and  [KM99]. 

This  chapter  develops  new  graphical  identification  criteria  that  generalize  and 
simplify  existing  criteria  in  several  ways.  In  Sections  5. 2-5. 5,  we  study  the  identifi- 
ability  problem  in  semi-Markovian  models.  Section  5.2  concerns  the  identification 
of  Px{v),  where  X  is  a  singleton  and  V  is  the  set  of  all  variables  excluding  X.  It 
asserts  that  Px(v)  is  identifiable  if  and  only  if  there  is  no  consecutive  sequence 
of  confounding  arcs  between  A"  and  A’s  children  in  the  graph.  When  interest 
lies  in  the  effect  of  A  on  a  subset  S  of  outcome  variables,  not  on  the  entire  set 
V,  it  is  possible,  however,  that  Px(s )  would  be  identifiable  even  though  Px(v ) 
is  not.  Section  5.3  first  gives  a  sufficient  criterion  for  identifying  Px(s),  which 
is  an  extension  of  the  criterion  for  identifying  Px(v).  It  says  that  Px(s )  is  iden¬ 
tifiable  if  there  is  no  consecutive  sequence  of  confounding  arcs  between  X  and 
APs  children  in  the  subgraph  composed  of  the  ancestors  of  S.  Other  than  this 
requirement,  the  diagram  may  have  an  arbitrary  structure,  including  any  number 
of  confounding  arcs  between  A"  and  S.  This  simple  criterion  is  shown  to  cover 
all  criteria  reported  in  the  literature  (with  X  singleton),  including  the  “back¬ 
door”,  “front-door”,  and  those  developed  by  [GP95] .  However,  the  criterion  is 
not  necessary  for  identifying  Px(s).  Section  5.3  further  devises  a  procedure  for 
the  identification  and  computation  of  P:v(s),  based  on  systematic  removal  of  cer¬ 
tain  nonessential  nodes  from  G.  This  procedure  is  shown  to  be  more  powerful 
than  the  one  devised  by  [GP95]  ([PeaOO,  pp.  114-8]).  Section  5.4  deals  with  the 
identification  of  general  causal  effects,  Pt(s),  where  T  and  S  are  arbitrary  subsets 
of  variables,  representing  multiple  interventions  and  multiple  outcomes,  such  as 
those  encountered  in  the  management  of  time  varying  treatments.  The  criterion 
established  in  this  section  extends  those  of  [PR95]  and  [KM99],  and  also  provides 
a  criterion  for  the  identification  of  direct  effects,  that  is,  the  effect  of  one  variable 
on  another  when  all  other  variables  are  held  fixed  (Section  5.4.4).  Section  5.5 
deals  with  the  identification  of  general  conditional  causal  effects  P4(s|c).  Finally, 
in  Section  5.6,  we  show  that  causal  effects  in  a  Markovian  model  with  arbitrary 
sets  of  unobserved  variables  can  be  identified  by  first  converting  the  model  into 
a  semi-Markovian  model  while  keeping  the  identifiability  properties. 

1  [GP95]  claimed  their  graphical  criterion  to  embrace  all  cases  where  identification  is  verifiable 
by  do-calculus.  We  show  in  this  chapter  (Section  5.3.7)  that  their  criterion  is  not  complete  in 
this  sense. 
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5.2  Identification  of  Px(v) 


Let  X  be  a  singleton  variable.  In  this  section  we  study  the  problem  of  identifying 
the  causal  effects  of  X  on  V  \  {X},  (namely,  on  all  other  variables  in  V),  a 
quantity  denoted  by  Px(v). 


5.2.1  The  easiest  case 

Theorem  13  If  there  is  no  bidirected  edge  connected  to  X,  then  Px(v)  is  identi¬ 
fiable  and  is  given  by 

Px(v)  =  P{y\x,pax)P(pax )  (5.2) 


Proof:  Since  there  is  no  bidirected  edge  connected  to  X,  we  have  that  the  term 
P(x\pax,  ux)  =  P(x\pax)  in  Eq.  (1.5)  can  be  moved  ahead  of  the  summation, 
giving 


P{v)  =  P{x\pax)  e  n  P(vi\pai,u%)P(u) 

u  {i\Vi^X} 

-  P{x\pax)Px(v).  (5.3) 


Hence, 


Px{v)  =  P(v)/P(x\pax)  =  P(v\x,pax)P(pax)  (5.4) 

□ 


Theorem  13  also  follows  from  Theorem  3.2.5  of  [PeaOO]  which  states  that 
for  any  disjoint  sets  S  and  T  in  a  Markovian  model  M,  if  the  parents  of  T 
are  measured,  then  Pt(s)  is  identifiable.  Indeed,  when  the  parents  of  X  are 
measured,  there  would  be  no  bidirected  edge  entering  X  in  the  semi-Markovian 
representation  of  M  and  the  identification  of  Px{v)  is  insured. 

5.2.2  A  more  interesting  case 

The  case  where  there  is  no  bidirected  edge  connected  to  any  child  of  X  is  also 
easy  to  handle.  As  an  example,  consider  the  graph  given  in  Figure  1.2.  We  have 


P(v)  —  P{z\x)  P(x\u)P(y\z,  u)P(u), 

V> 

(5.5) 

Px(v)  =  P(z\x)^2P(y\z,u)P(u). 

(5.6) 

U 
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From  Eq.  (5.5),  we  have 

P(x\u)P(y\z,  u)P(u)  =  P(v) / P{z\x),  (5-7) 

U 

hence, 

P(y\z,  u)P(u)  =  P(x\u)P(y\z,u)P(u)  =  P(v)/P(z\x).  (5.8) 

U  XU  X 

Substituting  Eq.  (5.8)  into  Eq.  (5.6),  we  obtain 

Px(y,z)  =  P(z\x)  ^rP(x',y,z)/P(z\x')  =  P(z\x)  P{y\x',  z)P(x').  (5.9) 

a/  x' 


This  derivation  can  be  generalized  to  the  case  where  X  has  several  children. 
Letting  Chx  denote  the  set  of  X’s  children,  we  have  the  following  theorem. 


Theorem  14  If  there  is  no  bidirected  edge  connected  to  any  child  of  X,  then 
Px(v)  is  identifiable  and  is  given  by 


P*{v) 


n  p^pai))  Y2 

{i\Vi€Chx}  x 


P(v) 

HmeChm}  P(.vi\pai) 


(5.10) 


Proof:  Let  S  =  V  \  ( Chx  U  {A"}).  Since  there  is  no  bidirected  edge  connected  to 
any  child  of  X ,  the  factors  corresponding  to  the  variables  in  Chx  can  be  moved 
ahead  of  the  summation  in  Eqs.  (1.5)  and  (1.6),  and  we  have 

p(v)=(  n  P{pi  |P^h))  ^  (  E(X'pC'7;.  U  )  |  P(Vj,  \pdj ,  U  ^P(vf ,  (5.11) 

{i\Vi€Chx}  U  {i|Vie5} 

and 

=  (  n  n  Piyilpauu^Piu).  (5.12) 

{i\Vi£Chx}  U  {i\Vi&S} 


The  variable  X  does  not  appear  in  the  factors  of  navies}  P(vi\Pah  hence  we 
augment  npi^eS}  P(vi\Pai> n*)  with  the  term  J2X  P(x\Pax, u*)  —  L  and  write 


e  n  P(vi\pai,ul)P(u)  =  EEp(  3'\pQ/xi  ^  P (Vj,\p(li)  U  ^ P (ll) 

u  {i\Vies}  x  u 

=?nww'(Mi'u))  (5'13) 
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Figure  5.1:  Theorem  14  is  applicable  for  identifying  Px{z\,  Z2,  z3,y). 


Figure  5.2:  The  problem  of  identifying  Px{z\ ,  z2)  V)- 

Substituting  this  expression  into  Eq.  (5.12)  leads  to  Eq.  (5.10).  □ 

The  usefulness  of  Theorem  14  can  be  demonstrated  in  the  model  of  Figure  5.1. 
Although  the  diagram  is  quite  complicated,  Theorem  14  is  applicable,  and  readily 
gives 

Px(zl,z2,z3,y)  =  P{z  i|a;,  z2)  P^XpUAZ^'zl)  V ^ 

=  P(zi\x,  z2)  y ^  P(y,  z3\x',  zu  z2)P{x\  z2).  (5.14) 

x' 

Note  that  this  expression  remains  valid  when  we  add  bidirected  edges  between 
Z-s  and  Y  and  between  Z3  and  Z2. 

5.2.3  The  general  case 

When  there  are  bidirected  edges  connected  to  the  children  of  AT,  it  may  still  be 
possible  to  identify  Px{y).  To  illustrate,  consider  the  graph  in  Figure  5.2,  for 
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which  we  have 


P(v)  =  £  P(x\u1)P(z2\z1,u1)P(u1)  ^  P{zi\x,  u2)P(y\x,  Z 1,  Z2,  U2)P(u2), 

Ui  U2 

(5.15) 


and 


Px(v)  =  J^P(z2|2:i,ui)P(ni)  J^P(zi|x,n2)P(y|x,2:i,Z2,'U2)P(u2).  (5.16) 

U\  U2 


Let 


Q i  =  P(x|mi)F(z2|zi,  ui)P(«i),  (5.17) 

(5.18) 


and 


Q2  =  ^ ~2P(zi\x,u2)P(y\x,Zi,z2,u2)P(u2 ).  (5.19) 

U2 


Eq.  (5.15)  can  then  be  written  as 


P(v)  =  •  Q2, 


(5.20) 


and  Eq.  (5.16)  as 

Px(v)  =  Q2J2Qi-  (5-21) 

X 

Thus,  if  Qi  and  Q2  can  be  computed  from  P(v),  then  Px(v)  is  identifiable  and 
given  by  Eq.  (5.21).  In  fact,  it  is  enough  to  show  that  Q i  can  be  computed  from 
P(v)  (i.e.,  identifiable);  Q2  would  then  be  given  by  P(v)/Q i.  To  show  that  Q i 
can  indeed  be  obtained  from  P(v),  we  sum  both  sides  of  Eq.  (5.15)  over  y ,  and 
get 

P(x,zi,z2)  =  Q i  ■^2p(z1\x)u2)P{u2).  (5.22) 

U2 

Summing  both  sides  of  (5.22)  over  z25  we  get 

P(x,  z1 )  =  P(x )  ^2  p(zi\x>  u2)P{u2 ),  (5.23) 

U2 

hence, 

'Y2  P{zi\x-u2)P{u2)  =  P(zi|x).  (5.24) 

U2 
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From  Eqs.  (5.24)  and  (5.22), 

Qi  =  P(x,  zu  Z2)/P{z1\x)  =  P(z2 \x,  Zi)P(x),  (5.25) 

and  from  Eq.  (5.20). 

Q2  =  P(v)/Q i  =  P(y\x,  zu  z2)P(zi\x).  (5.26) 

Finally,  from  Eq.  (5.21),  we  obtain 

Px(v)  =  P(y\x,  zu  z2)P(z1\x)  '^2p(z2\x’,Zi)P(x').  (5.27) 

x' 

From  the  preceding  example,  we  see  that  because  the  two  bidirected  arcs  in 
Figure  5.2  do  not  share  a  common  node,  the  set  of  factors  (of  P(v))  containing 
U\  is  disjoint  of  those  containing  U2,  and  P(v)  can  be  decomposed  into  a  product 
of  two  terms,  each  being  a  summation  of  products.  This  decomposition  has  been 
studied  in  Chapter  4,  in  which  we  introduced  the  ideas  of  “c-component”  and 
“c-factor” . 

5.2.3. 1  C-component 

Two  variables  are  in  the  same  c-component  if  and  only  if  they  are  connected  by 
a  bidirected  path,  a  path  composed  entirely  of  bidirected  edges.  We  will  use  the 
Q[.]  notation  defined  in  Chapter  4.  For  any  set  C  C  V,  Q[C\(v)  denotes  the 
following  function  (see  Eq.  (4.41)) 

Q[C](v)  =  P„v(c)  =  e  n  Pivilpa^u^Piu).  (5.28) 

«  {j|v<ec} 

For  any  set  C ,  let  Gc  denote  the  subgraph  of  G  composed  only  of  variables  in  C. 
We  rewrite  Corollary  1  as  a  lemma  in  the  following,  tailored  for  semi-Markovian 
models. 

Lemma  7  Assuming  that  V  is  partitioned  into  c-components  Si, ... ,  Sk,  we  have 

ft)  PW  =  11  OR]- 

(ii)  Let  a  topological  order  over  V  be  Vx  <  ...  <  Vn,  and  let  —  {Vx, . . . ,  Vi}, 
i  =  1, . . . ,  n,  and  =  0.  Then  each  c-factor  Q[Sj],  j  —  1, . . . ,  k,  is  identifiable 
and  is  given  by 


QiSj)  -  []  PiVilv^).  (5.29) 

{*1  Vi€Sj} 
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U3 
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Y 


U2 


Figure  5.3:  An  example  for  applying  Lemma  7. 


(ii)  Each  factor  P(vz \v^  ^)  can  be  expressed  as 


P{vi\v^  V)  =  P(vi\pa(Ti)\{vi}), 


(5.30) 


where  Ti  is  the  c-component  of  GV(.)  that  contains  V). 

We  show  the  use  of  Lemma  7  by  an  example  shown  in  Figure  5.3,  which  has 
two  c-components  Si  —  {X2,X4}  and  S2  —  {X4,X3 ,Y}.  P(v)  decomposes  into 


P( xi,  x2,  x3,  x4,  y)  =  Q['S'i]Q[52],  (5.31) 

where 

Q[*S'i]  =  P{x2\xi,u2)P(x4\x3,  u2)P{u2 ),  (5.32) 

U2 

Q[S2]  =  'Yl  p(xi\ui)P{xs\x2,ui,u3)P(y\x4,u3)P(ui)P(u3).  (5.33) 

«1|«S 

By  Lemma  7,  both  Q[Si\  and  Q[S2\  are  identifiable.  The  only  admissible  order 
of  variables  is  Xx  <  X2  <  X3  <  X4  <  Y,  and  Eq.  (5.29)  gives 

Q[Si]  =  P{x4\x1,x2,x3)P{x2\x1),  (5.34) 

Q[S2\  =  P{y \xx,x2,  x3,  x4)P(yx3\xl,x2)P(xl).  (5.35) 


We  can  also  check  that  the  expressions  obtained  in  Eq.s  (5.25)  and  (5.26)  for 
Figure  5.2  satisfy  Lemma  7. 

5. 2. 3. 2  An  identification  criterion  for  Px(v) 

Lemma  7  has  important  implications  on  the  general  identifiability  problem,  and 
in  this  section  we  show  how  to  use  this  property  to  identify  Px(v). 
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Let  X  belong  to  the  c-component  Sx ,  and  let  other  c-components  be  Si, . . . ,  S*. 
We  have 

P{v)  =  QlS^llQlSi],  (5.36) 

i 

and 

i3,(«)  =  Q[^\W]nQ[5J.  (5.37) 

i 

Since  all  Q[Si]’s  are  identifiable  by  Lemma  7,  Px(v)  is  identifiable  if  and  only  if 
Q[SX  \  {X}]  is  identifiable,  and  we  have  the  following  theorem. 


Theorem  15  If  there  is  no  bidirected  path  connecting  X  to  any  of  its  children, 
then  Px{ v)  is  identifiable  and  is  given  by 


Px(v) 


P(v) 

Q[^] 


(5.38) 


where  Sx  is  the  c-component  that  contains  X. 


Proof:  If  there  is  no  bidirected  path  connecting  X  to  any  of  its  children,  then  none 
of  X’s  children  is  in  Sx .  Under  this  condition,  removing  the  term  P(x\pax,  ux) 
from  Q[SX ]  is  equivalent  to  summing  Q[SX]  over  X,  and  we  can  write 

Q[5*\{X}]  =  ^Q[S*].  (5.39) 

X 

Hence  from  Eq.s  (5.37)  and  (5.36),  we  obtain 

pm = E<3isx])n«- = <e^sxd^i’  (5'4°) 

x  i  x  ^  L  J 

which  proves  the  identifiability  of  Px(v).  CD 


We  demonstrate  the  use  of  Theorem  15  by  identifying  PXl  (x2,  x3,  x4,  y)  in  Fig¬ 
ure  5.3.  The  graph  has  two  c-components  S\  =  {X2,X4}  and  S2  =  {Xi,  Xs,Y}, 
with  corresponding  c-factors  given  in  (5.34)  and  (5.35).  Since  Xi  is  in  S2  and 
its  child  X2  is  not  in  S2,  Theorem  15  ensures  that  PXl(x2,x3,x4,y )  is  identifiable 
and  is  given  by 

Px  1  (x2,  x3,  Xi,  y)  =  Qt'S'l]  Y 

ail 

=  P(x4\xi,  X2,  xf)P(x2\xi)  Y  P{y\x'i,  x2,  x3,  x4)P(x3 |x'i,  X2)P(x[).  (5.41) 

*i 

More  examples  where  Theorem  15  is  applicable  can  be  found  in  Figure  3.8  of 
[PeaOO],  some  of  which  required  complicated  do-calculus  derivations. 
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5. 2. 3. 3  Necessity  of  the  criterion 

Next  we  will  show  that  the  condition  given  in  Theorem  15  is  also  necessary  for 
the  identifiability  of  Px(v).  To  facilitate  the  proof  of  necessity,  first  we  prove  the 
following  lemma. 

Lemma  8  Let  S,  T  C  V  be  two  disjoint  sets  of  variables.  If  Pt(s)  is  not  identifi¬ 
able  in  G,  then  Pfis)  is  not  identifiable  in  the  graph  resulted  from  adding  a  directed 
or  bidirected  edge  to  G.  Equivalently,  if  Pt(s)  is  identifiable  in  G,  then  Pt(s)  is 
still  identifiable  in  the  graph  resulted  from  removing  a  directed  or  bidirected  edge 
from  G. 

Proof:  If  Pt(s)  is  not  identifiable  in  G,  then  there  exist  two  models  with  the  same 


causal  graph  G,  Mi  and  M2,  such  that 

PMfiv)  =  PM2(v)  >  0,  and  PtMl(s )  PtM2(s),  (5.42) 

where 

PMk(v)  =  eti  PMk{pi \pai,  ui)PMk(u),  A:  =  1,2.  (5.43) 

u  i 

For  a  graph  G'  with  extra  edges  added  to  G,  we  can  always  construct  new  models 
in  such  a  way  that  the  added  edges  are  ineffective. 

(i)  Let  G'  be  the  graph  identical  to  G  except  with  an  extra  edge  Y  — >  Vj. 
P(v)  decomposes  as 

P (v)  =  P{vj\pap  y,  uJ)  P(vi\pai,  ul)P(u).  (5.44) 

U  ijij 

We  construct  two  models  M[  and  Afi  with  the  causal  graph  G'  as 

PM'*(v.i\pai,ul)  =  PMk{vi\pau  ul),  ifij,  k  =  1,  2,  (5.45) 

PM'k(vj\paj,  y,  v?)  —  PMk(vj\paj,  uf),  k  —  1,2,  (5.46) 

pK[u)  =  PMk{u),  k  —  1,2.  (5.47) 

Clearly,  if  the  pair  (Mi,M2)  satisfies  (5.42),  so  would  the  pair  (M[,M'2).  Hence 
Pt{s)  is  not  identifiable  in  G' . 


(ii)  Let  G'  be  the  graph  identical  to  G  except  with  an  extra  edge  V)  >  Vj. 
P(v)  decomposes  as 

P(v)  =  E  p(”')  E  P  ipj  \p^j  i  )  u  )  P  {pi  ,  u  ,  u  )  J  |  P  (v%  ,  u  )  P  (u.) , 

u'  u 

(5.48) 
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Figure  5.4:  A  graph  used  in  proving  Theorem  16. 
where  U'  represents  the  new  unobserved  variable.  We  construct  two  models  M[ 


and  M2  with  the  causal  graph  G'  as 

PM'*{vi\paiy)  =  PMk{vi\pauui ),  i^l,  k  =  1,2,  (5.49) 

PM'k(vi\pai,u\u')  =  PMk(vi\pai,  v?),  i  =  j,l,  k  =  1,2,  (5.50) 

PMk(u)  =  PMk(u),  k  =  1,2.  (5.51) 

Again,  if  the  pair  (Mi,M2)  satisfies  (5.42),  so  would  the  pair  (M(,M2).  Hence 
Pt(s)  is  not  identifiable  in  G' .  □ 


Next  we  prove  that  the  condition  given  in  Theorem  15  is  necessary. 

Theorem  16  If  there  is  a  bidirected  path  connecting  X  to  any  of  its  children  in 
G,  then  Px{v )  is  not  identifiable. 

Proof:  Let  Y  be  a  child  of  X  and  assume  that  there  is  a  bidirected  path  connecting 
X  and  Y  with  variables  Z\, . . . ,  Zk  on  the  path  (see  Figure  5.4).  We  will  prove 
that,  for  any  k  >  1,  Px{y,  zi; . . . ,  z/f)  is  not  identifiable  in  the  graph  shown  in 
Figure  5.4,  which  is  a  subgraph  of  G.  By  Lemma  8,  if  Px(y,  z\, . . . ,  Z/~)  is  not 
identifiable  in  a  subgraph  of  G,  then  it  is  not  identifiable  in  G,  and  therefore 
Px{v)  is  not  identifiable  in  G. 

Let  U  =  {Ui, . . . ,  Uk+i).  In  Figure  5.4,  we  have 
P(xiy,z1,...,zk) 

=  ^2P(x\u1)P(y\x,uk+1)P(z1\uuu2)  ■  ■  ■  P(zk\uk,uk+i)P(ui)  ■  •  ■  P(uk+1), 

U 

(5.52) 


and 


Px{y,z zk) 

=  Y^Phj\XiUk+l)P(Zl\UliU2)  •  •  ■  P(zk\uk,uk+i)P(ui)  ■  ■■P(uk+ 1).  (5.53) 

U 
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Let  all  variables  X,  Y,  Z1,...,Zk,  Ui, . . .  ,Uk+i  be  binary  variables.  We  will 
prove  the  nonidentifiability  of  Px(y,  zx, . . . ,  zk)  by  constructing  two  models  such 
that  in  both  models, 

P(x,  zk )  =  (l/2)fc+2,  for  all  possible  values  of  x,  y,  2+ . . . ,  zk,  (5.54) 

while  Px(y,zi, . . .  ,zk)  has  different  values  in  the  two  models.  The  construction 
involves  the  specification  of  all  conditional  probabilities  in  a  parametric  form,  and 
shows  two  different  parameterization  both  satisfying  the  set  of  2k+2  equations  in 
(5.54).  We  use  the  following  parameterization,  with  five  parameters,  a,  b ,  c,  d,  and 

e. 


P(ui )  =  1/2,  Ui  =  0, 1,  and  i  =  1, . . . ,  k  +  1  (5.55) 


X 

U\ 

P{x\u\) 

IT 

0 

1/2  + a 

(5.56) 

0 

1 

1/2  -  a 

y 

P(y\x,uk+ 1) 

0 

0 

0 

1/2  +  6 

0 

0 

1 

1/2  -  b 

(5.57) 

0 

1 

0 

1/2 

0 

1 

1 

1/2 

Z\ 

Ui 

u2 

P{zi\ui,u2) 

0 

0 

0 

1/2  +  c 

0 

0 

1 

1/2  -  c 

(5.58) 

0 

1 

0 

1/2  +  d 

0 

1 

1 

1/2  -  d 

z% 

Ui 

^i+1 

P(Zi\Ui,Ui+1) 

0 

0 

0 

1/2  +  e 

0 

0 

1 

1/2  -  e 

0 

1 

0 

1/2  -  e 

0 

1 

1 

1/2 +  e 

Substituting  (5.55)  in  (5.52),  Eq.  (5.54)  becomes 

^  =  7;  P{x\ui)P(y\x,  uk+l)P(Zl |rt!,  u2)  ■  ■  ■  P{zk\uk,  uk+i).  (5.60) 

U 
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Next,  we  prove  that  if  Eq.  (5.60)  is  satisfied  for  x  —  0,  y  =  0,  z\  =  0, . . . ,  zk  —  0, 
then  it  is  satisfied  for  all  possible  values  of  x,  y,  Z\, . . . ,  zk.  We  have  that  for  any 
a,  b ,  c,  d,  e,  the  parameterization  given  in  Eqs.  (5.55)-(5.59)  satisfies  the  following 
properties 


Yp(x\ui)  =  i. 

Ul 

(5.61) 

Y  P{y\x,uk+1 )  =  l. 

(5.62) 

“t+i 

Y  P(Zi\Ui'  U*+l)  =  M  =  1,  •  •  •  ,  &• 

(5.63) 

ui+ 1 

Y  P(zi\ui,  Ui+i)  =  1,  i  =  2, . . . ,  k. 

(5.64) 

Ui 


(a)  For  x  —  1  and  any  values  of  y,  Zi, . . . ,  zk,  Eq.  (5.60)  is  satisfied: 

YP^X  =  l\ui)P(y\x  =  l,uk+l)P(z1\ui,u2)  ■  ■  ■  P(zk\uk,uk+1) 

U 

=  \YP^X  =  l\ui)p{z i|«i,«2)  •  •  ■  P{zk\uk,uk+1)  (by  P(y\x  =  l,uk+i)  =  1/2) 

U 

—  ^  (by  Eqs.  (5.63)  and  (5.61))  (5.65) 

(b)  If  for  a  particular  set  of  values  x,  y,  zi, ... ,  zk,  Eq.  (5.60)  is  satisfied,  then  for 
the  set  of  values  x,  1  —  y,  zlt . . . ,  zk,  Eq.  (5.60)  is  also  satisfied: 


Y,  P(x\ui)P(l  -  y\x,uk+i)P(zl\u1,u2)P(z2\u2,u3)  ■  ■  ■  P(zk\uk,  uk+i) 

U 

=  YP(x\ui)(l  -  P(y\x,uk+i))P(z1\uuu2)P{z2\u2,u3)  ■  ■  ■  P(zk\uk,uk+l) 

U 

=  Y  P{x\ui)P(zi\uu  u2)P(z2\u2,  u3)  ■  ■  ■  P(zk\uk,uk+i)  -  ^  (by  Eq.  (5.60)) 

U 

=  1  —  -  (by  Eqs.  (5.63)  and  (5.61)) 

2 

=  \  (5-66) 
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(c)  If  for  a  particular  set  of  values  x,  y,Zi, . . .  ,zk,  Eq.  (5.60)  is  satisfied,  then  for 
the  set  of  values  x,  y,  z\, . . . ,  Zj_i,  1  —  Zi,  zi+i, . . . ,  zk,  for  i  =  1, . . . ,  k,  Eq.  (5.60) 
is  satisfied  as  well: 


y;  P(x\ui)P(y\x,  uk+i)P(zi \ui,u2)  ■  ■  ■  P{  1  -  Zi\ui,ui+i)  ■  ■  ■  P(zk\uk,  uk+i) 

U 

=  Ep(  x\u1)P(y\x,uk+i)P(zi\u1,u2)  ■  ■  ■  P(zi-i\ui-i,Ui)P(zi+i\ui+1,Ui+2) 

U 

■  ■  ■  P(zk\uk)  uk+ 1)  -  ~  (by  Eq.  (5.60)) 

=  1  -  ~  (by  Eqs.  (5.61)— (5.64)) 

=  l  (5.67) 

From  (a),  (b),  and  (c),  we  obtain  that  if  Eq.  (5.60)  is  satisfied  for  x  =  0,y  — 
0,  z\  =  0, . . . ,  zk  =  0,  then  it  is  satisfied  for  all  possible  values  of  x,  y,  Z\, . . . ,  zk. 

Next,  we  substitute  the  conditional  probabilities  given  in  Eqs.  (5.55)— (5.59) 
into  Eq.  (5.60)  for  x  =  0,  y  —  0,  Z\  =  0, . . . ,  zk  —  0.  Define 


fu2,uk+ 1  =  P(Z2  =  °iw'2’ u^"'  P^Zk  =  0lUfc’ Uk+ ^  (5-68) 

U3y..,Uk 


We  obtain 

foe  =  (1/2  +  +  (*  “  *)  (1/2  +  e)k-\l/2  -  e)2 

+  4  0  11/2  +  f?i^5(1/2  -  e>4  +  "  ■ 

i<k/ 2  ,  v 

=  E(  Z  )<1/2  +  e)‘-1-2l(l/2-e)2>.  (5.69) 

4=0  '  ' 

From  Eq.  (5.64),  we  have 

EW.  =  1-  (5-70) 

U2 

From  Eq.  (5.63),  we  have 

Eu.  =  1'  (5.71) 
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Let  /  =  /oo  -  1/2,  then  fU2,Uk+1  is  given  as 


U2 

^fc+1 

*/^2 

0 

0 

1/2  +  / 

0 

1 

1/2-/ 

1 

0 

1/2-/ 

1 

1 

1/2  +  / 

Therefore,  for  x  =  0,  y  —  0,  Zi  =  0, . . . ,  zk  =  0,  Eq.  (5.60)  becomes 

^  =  °\Ul)P(y  =  =  0,  Uk+i)P(zi  =  0|ui,  U2)fU2,uk+1 

Ml  ,uk+Uu2 

=  (1/2  +  a)(  1/2  +  6)  [(1/2  +  c)(l/2  +  /)  +  (1/2  —  c)(l/2  —  /)] 

+  (1/2  +  a)(l/2  -  6)[(l/2  +  c)(l/2  -  /)  +  (1/2  -  c)(l/2  +  /)] 

+  (1/2  -  a)(l/2  +  6) [(1/2  +  d)(  1/2  +  /)  +  (1/2  -  d)(l/2  -  /)] 

+  (1/2  -  a)(l/2  -  6) [(1/2  +  <i) (1/2  -  /)  +  (1/2  -  d)(  1/2  +  /)) 

=  1/2  +  2bf(c  +  d  +  2ac  —  2ad)  (5.72) 

which  leads  to 

bf(c  +  d  +  2ac  —  2ad)  —  0.  (5.73) 

Px(y,  Zi, . . . ,  Zk)  is  computed  as 

Px=o(y  =  o,zi  =  o, . .  .,zk  =  o) 

=  2^+1  £  P(y  =  °l;r  =  °>  uk+i)P(zi  =  0|«1,  U2)fu2,uk+1 

Ul,Uk+1,U2 

=  ^l[1+4i'/(c+‘i)]  (5‘74) 

Let  — 1/2  <  eo  <  1/2  be  a  number  such  that  f  ^  0.  Consider  the  following 
two  models: 

Model  1  a  =  1/4,  b  —  0,  c  =  d  =  1/4,  e  —  e0. 

Model  2  a  =  1/4,  b  =  1/4,  c  =  1/12,  d  =  -1/4,  e  =  e0. 

Eq.  (5.73)  holds  in  both  models,  hence  the  two  models  have  the  same  distribu¬ 
tion  P(x,y,zi,  ...,zk)  =  (l/2)k+2.  In  Model  1,  Px=0(y  -  0,  zx  =  0, . . . ,  zk  -  0)  = 
(l/2)*+1.  In  Model  2,  Px=0(y  -  0,  ^  =  0, . . . ,  zk  =  0)  =  (l/2)*+1(l  -  // 6).  Since 
/  ^  0,  we  have  that  Px=o{y  —  0,zi  =  0 ,...,zk  =  0)  takes  different  values  in 
Model  1  and  2.  □ 
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t 

X 


►  Y 


Figure  5.5:  Px{y,z)  is  not  identifiable  but  Px(y)  is. 

5.3  Identification  of  Px(s) 

Let  X  be  a  singleton  variable  and  S  C  V  be  a  set  of  variables.  In  this  section,  we 
study  the  problem  of  identifying  Px(s).  Clearly,  whenever  Px(v)  is  identifiable, 
so  is  Px(s).  However,  there  are  obvious  cases  where  Px(v)  is  not  identifiable  and 
still  Px(s)  is  identifiable  for  some  subsets  S  of  V.  The  simplest  such  example  can 
be  seen  in  Figure  5.5,  which  is  a  special  case  of  Figure  5.4  with  k  =  1.  Here, 
variable  Z  can  be  ignored  in  the  computation  of  Px(y),  giving  Px(y)  —  P(y\x) 
and  Px{z)  —  P(z),  while  (by  Theorem  16)  Px(y,z)  is  not  identifiable.  This 
example  suggests  that  a  criterion  similar  to  that  of  Theorem  15,  applicable  in 
some  subgraphs  of  G,  would  establish  the  identifiability  of  Px(s).  We  will  show 
indeed  that  Px(s )  is  identified  when  a  systematic  removal  of  certain  nonessential 
nodes  from  G  will  lead  to  an  identification  criterion  based  on  Theorem  15.  First 
we  give  a  criterion  for  identifying  Px(s)  which  is  a  simple  extension  of  Theorem  15. 

5.3.1  A  criterion  for  identifying  Px(s) 

For  any  set  C  C  V,  let  An{C)  denote  the  union  of  C  and  the  set  of  ancestors  of 
the  variables  in  C.  The  nonancestors  of  S  are  nonessential  for  identifying  Px(s) 
and  we  have  the  following  lemma. 

Lemma  9  Px(s)  is  identifiable  if  and  only  if  in  the  subgraph  Gau(S),  Px(s)  is 
identifiable. 

Proof:  (only  if)  By  Lemma  8. 

(if)  Summing  both  sides  of  Eq.  (1.5)  over  V\An(S),  we  have  that  the  marginal 
distribution  P(an(S))  decomposes  exactly  according  to  the  graph  Gau(s)-  Hence 
if  Px(s)  is  identifiable  in  Gau(s p  then  it  is  computable  from  P(an(S)),  and  there¬ 
fore  is  identifiable  in  G.  □ 

From  Lemma  9,  a  direct  extension  of  Theorem  15  leads  to  the  following  criterion. 

Theorem  17  Px(s)  is  identifiable  if  there  is  no  bidirected  path  connecting  X  to 
any  of  its  children  in  Gati(s)- 
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X 


Figure  5.6:  A  graph  used  in  proving  Proposition  1. 

When  the  condition  in  Theorem  17  is  satisfied,  we  can  compute  Px(an(S ))  by 
applying  Theorem  17  in  Gau{s )>  and  Px(s)  can  be  obtained  by  marginalizing  over 
Px(an(S)). 

This  simple  criterion  can  classify  correctly  all  the  examples  treated  in  the 
literature  with  X  singleton,  including  those  contrived  by  [GP95].  In  fact,  for  X 
and  S  being  singletons,  we  will  show  that  if  there  is  a  bidirected  path  connecting 
X  to  one  of  its  children  such  that  every  node  on  the  path  is  in  An(S),  then  none 
of  the  “back-door”,  “front-door”,  and  [GP95]  criteria  is  applicable.  The  criterion 
in  [GP95]  (which  will  be  called  the  G-P  criterion)  is  for  identifying  Px(y)  with  X 
and  Y  being  singletons,  and  it  includes  the  “front-door”  and  “back-door”  criteria 
as  special  cases  (see  [PeaOO,  pp.  114-8]). 

Proposition  1  If  there  is  a  bidirected  path  connecting  X  to  one  of  its  children 
such  that  every  node  on  the  path  is  an  ancestor  of  Y ,  then  the  G-P  criterion  is 
not  applicable. 

Proof:  There  are  four  conditions  in  the  G-P  criterion,  among  which  Condition 
1  is  a  special  case  of  Condition  3,  and  Condition  2  is  trivial.  Therefore  we  only 
need  to  consider  Condition  3  and  4. 

Assume  that  there  is  a  bidirected  path  p  from  X  to  its  child  Yx  such  that 
every  node  on  p  is  an  ancestor  of  Y,  and  that  there  is  a  directed  path  q  from  Yx 
to  Y .  We  will  show  by  contradiction  that  neither  Condition  3  nor  Condition  4  is 
applicable  for  identifying  Px(y).  For  any  set  Z,  a  node  will  be  called  Z -active  if 
it  is  in  Z  or  any  of  its  descendants  is  in  Z,  otherwise  it  will  be  called  Z -inactive. 

(Condition  3)  Assume  that  there  exists  a  set  Z  that  blocks  all  back-door 
paths  from  X  to  Y  so  that  Px(z)  is  identifiable.2 3  If  every  internal  node  on  p  is  an 
ancestor  of  X,  or  if  every  nonancestor  of  X  on  p  is  Z-active,  then  let  W\  =  Yx, 
otherwise  let  Wx  be  the  Z-inactive  non-ancestor  of  X  that  is  closest  to  X  on  p 
(see  Figure  5.6).  If  every  internal  node  on  the  subpath  p(Wx,X)  3  is  Z-active, 

2  A  path  from  X  to  Y  is  said  to  be  a  back-door  path  if  it  contains  an  arrow  into  X . 

3We  use  p(Wi,X)  to  represent  the  subpath  of  p  from  W\  to  X. 
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Figure  5.7:  Graphs  used  in  proving  Proposition  1. 

then  let  W2  =  A,  otherwise  let  W2  be  the  Z-inactive  node  that  is  closest  to  W\ 
on  p(W\,X).  From  the  definition  of  W\  and  W2,  W2  must  be  an  ancestor  of  A" 
(or  be  X  itself),  and  let  px  be  any  directed  path  from  W2  to  X.  (i)  If  Wx  Y\, 
letting  p2  be  any  directed  path  from  W\  to  Y ,  then  from  the  definition  of  Wx  and 
W2  the  path  p'  =  (pi(A,  W2),p(W2,  Wx),p2(Wx,  Y))  is  a  back-door  path  from  A" 
to  Y  that  is  not  blocked  by  Z  (see  Figure  5.6)  since  W2  is  Z-inactive,  all  internal 
nodes  on  p(W2 ,  Wx)  is  Z-active,  and  Wx  is  Z-inactive.  (ii)  If  Wx  =  YX)  there  are 
two  situations: 

(a)  Z  consists  entirely  of  nondescendants  of  X.  Then  the  path 

p"  =  (pi(A,  W2),p(W2,  Yi),  q{Y\i  Y))  is  a  back-door  path  from  X  to  Y  that  is  not 
blocked  by  Z. 

(b)  Z  contains  a  variable  Y1  on  q(Yx,  Y)  so  that  Px{z)  is  identifiable.  By  the 
definition  of  Wx,  every  node  on  p  is  an  ancestor  of  Z.  Px(z)  can  not  be  identified 
by  Theorem  17,  and  the  G-P  criterion  is  not  applicable  for  identifying  Px(z)  if 
Z  contains  more  than  one  variable.  If  Z  contains  only  one  variable  Y1,  then 
every  node  on  p  is  an  ancestor  of  Y'.  If  Px(y')  is  identifiable  by  Condition  3  of 
the  G-P  criterion  (Condition  4  is  not  applicable  as  proved  later),  then  from  the 
preceding  analysis  there  is  a  Y"  on  the  path  q(Yx,  Y')  such  that  every  node  on  p 
is  an  ancestor  of  Y"  and  Px(y ")  is  identifiable.  By  induction,  in  the  end  we  have 
every  node  on  p  is  an  ancestor  of  Y\  and  Px{yi )  is  identifiable,  which  does  not 
hold  from  the  preceding  analysis. 

(Condition  4)  Assume  that  there  exist  sets  Zx  and  Z2  that  satisfy  all  (i)-(iv) 
conditions  in  Condition  4.  Since  Zx  has  to  block  the  path  ((A,  Yx),  q(Yx,  Y)),  let 
V\  be  the  variable  in  Zx  that  is  closest  to  Yx  on  the  path  q  (see  Figure  5.7(a)).  If 
none  of  the  internal  node  on  p  is  in  An(Vi)\An(X)  (the  set  of  ancestors  of  V\  that 
are  not  ancestors  of  A)  or  if  every  variable  in  An{V\)  \  An(X)  on  p  is  Z2-active, 
then  let  Wx  —  Yx,  otherwise  let  Wx  be  the  Z2-inactive  variable  in  An(Vi)  \  An(X) 
that  is  closest  to  A  on  p.  Let  px  be  any  directed  path  from  Wx  to  V\.  If  every 
internal  node  on  the  subpath  p{Wx,  A)  is  Z2-active,  then  let  W2  —  A,  otherwise 
let  W-2  be  the  Z2-inactive  node  that  is  closest  to  Wx  on  p(Wi,  A).  Since  W2  must 
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Figure  5.8:  Subgraphs  of  G  used  in  computing  Px{y ). 


be  an  ancestor  of  Y,  from  the  definition  of  W1  and  W2,  there  are  two  possible 
situations: 

(a)  W2  is  an  ancestor  of  X  or  W2  —  X.  Let  p2  be  any  directed  path  from 
W2  to  X  (see  Figure  5.7(a)).  From  the  definition  of  W\  and  W2,  the  path  p'  = 
(p2(X,  W2),p(W2 ,  LFi),pi(hF]  ,  Vi))  is  a  back-door  path  from  X  to  Vi  £  Z\  that  is 
not  blocked  by  Z2  that  does  not  contain  any  descendant  of  X  (see  Figure  5.7(a)). 

(b)  W2  is  an  ancestor  of  Y  but  not  ancestor  of  V\  (W2  £  An(Y)  \An(Vi)).  Let 

p3  be  any  directed  path  from  W2  to  Y  (see  Figure  5.7(b)).  From  the  definition 
of  Wi  and  W2,  the  path  p"  =  (pi(Vi,Wi),p(Wi,W2),p3(W2,Y))  is  a  back-door 
path  from  Vi  €  Z\  to  Y  that  is  not  blocked  by  Z2  (see  Figure  5.7(b)).  □ 

However,  the  criterion  in  Theorem  17  is  not  necessary  for  identifying  Px(s).  In 
the  next  section,  we  give  an  example  in  which  Px(s)  is  identifiable  but  Theorem  17 
is  not  applicable,  and  the  process  of  computing  Px(s)  will  give  us  hints  on  how 
to  improve  the  criterion. 

5.3.2  An  example 

To  illustrate  the  general  process  of  computing  Px(s)  making  use  of  the  factor¬ 
ization  of  P(v)  into  c- factors,  we  work  out  an  example  in  this  section.  Consider 
the  problem  of  identifying  Px(y)  in  Figure  5.8(a).  Theorem  17  is  not  applicable, 
but  we  will  show  that  Px(y)  is  identifiable.  Let  V  =  {X,  Z ,  Y,  W\,  W2}  and  V'  = 
{Z,Y,W1;W2}.  V  is  partitioned  into  three  c-components:  Sx  —  {X,Z,W 1}, 
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{W2},  and  {y}.  P(v)  can  be  decomposed  into 

P(v)  =  P(w2\w1)P(y\z)Q[SA ] 


(5.75) 


where 

Q[SX]  =  y;  p{x\w2,  Ui)P(wi\ui,u2)P(z\x,  u2)P(ui)P(u2 )  (5.76) 

Wl,«2 

=  P(u)/(P(w2|ii;i)P(y|,z))  =  P(^,x|m)2,  ^i)P(^i)-  (5.77) 

Px(v')  is  decomposed  into 

Px(v')  =  Q[v']  =  P(w2\wi)P(y\z)  22  P(wi\ui,u2)P(z\x,u2)P(ui)P(u2). 

U\,U2 

(5.78) 

We  want  to  compute  Px(y)‘. 

PM=  E  P.W 

Z,W  i,VJ2 

=  E  Q[V] 

Z,Wi,W2 

=  Y^Pi.y\z)  22  P{wi\ui,u2)P(z\x,u2)P(u1)P(u2 )  (^P(u;2|^i)  =  1) 

2,W1  Ul,«2  ^2 

=  'Yl,P{y\z)  22  P{z\x,u2)P(ui)P(.u2)  (J^P(u;i|ui,u2)  =  1) 

jzt  wi,«2  wi 

=  ^P(y|*)Q[{Z}].  (5.79) 

Note  that  the  key  reason  for  the  factors  of  W\  and  W2  to  be  summed  out  is 
that  Q[V']  factorizes  according  to  the  subgraph  Gy  and  that  W\  and  W2  are  not 
ancestors  of  Y  in  Gy  (see  Figure  5.8(b)).  The  problem  of  computing  Px(y)  is 
then  reduced  to  computing  Q[{Z}},  which  may  be  computed  from  Q[5X].  Again, 
noticing  that  W\  is  not  an  ancestor  of  Z  in  G5x(see  Figure  5.8(c)),  we  sum  W\ 
over  Eq.  (5.76): 


Y.Qisxl  =  Q[{x,z}] 

Wl 

=  y~]  P(x\w2,  Ui)P(z\x,  U2)P(ui)P(u2) 

Ul,U2 

=  (22  P(X\W 2’  Ul)P(Ul))(22  P(Z\X>  U2)P{U2)) 


U 1 


U2 


=  Q[{X}]Q[{Z}} 


(5.80) 

(5.81) 

(5.82) 

(5.83) 
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To  compute  Q[{X}]  and  Q[{Z}],  summing  Z  over  Eq.  (5.82),  we  obtain 

ami  =  E  =  E  wi)pM>  (5-84) 


Z,Wl 


and  from  Eq.  (5.83) 


Q[{z}}  = 


Y,w,  Q[sx]  E«„  ®i)pK) 


(5.85) 


Q[{x}}  YIWlp(x\w2’wi)P(wi) 

Finally,  substituting  the  expression  for  Q[{Z}]  (5.85)  into  Eq.  (5.79),  we  obtain 


Px(y )  =  y^P(y\z)-T= — zg~,  j - — x-- 

,  T.w,P\.X\w2,Wl)P(Wi) 


(5.86) 


From  this  example,  we  see  that  the  quantity  Q[C)  defined  in  Eq.  (5.28) 
plays  an  important  role  in  identifying  Px(y ).  The  ingredients  that  allowed  us 
to  compute  Px(y)  were  (i)  our  ability  to  sum  out  some  factors  from  Q[V'}  as  in 
Eqs.  (5.79),  due  to  the  fact  that  W\  and  W2  are  not  ancestors  of  Y  in  Gy1]  (ii) 
our  ability  to  compute  Q[{Ai}]  and  Q[{Z}]  from  Q[{ X,  Z}],  which  is  due  to  the 
decomposition  of  Q[{X,Z}}  into  the  product  of  Q[{A}]  and  Q[{Z}]  (Eq.  (5.83)) 
because  in  the  graph  G{X,z}  (Figure  5.8(d)),  {X,  Z}  is  partitioned  into  two  c- 
components  {X}  and  {Z}.  These  two  points  correspond  to  Lemma  2  and  3  in 
Chapter  4  respectively,  which  will  be  presented  next. 


5.3.3  Lemmas 

Lemma  2  and  3  in  Chapter  4  will  be  instrumental  in  facilitating  the  general 
computing  of  causal  effects  Px(s).  The  two  lemmas  are  presented  in  the  following 
tailored  for  the  situation  of  semi-Markovian  models. 

Lemma  10  Let  W  C  C  C  V ,  and  W  —  C\W.  IfW  is  an  ancestral  set  in  the 
subgraph  Gc  (that  is,  An(W)cc  —  W7 A  or  equivalently,  if  none  of  the  parents  of 
W  is  in  W'  (PaiW)  n  W'  =  0),  then 

E< 3[C]  =  Qlw}-  (5-87) 

w> 


Lemma  10  provides  a  condition  under  which  summing  Q[C]  over  some  variables 
is  equivalent  to  removing  the  corresponding  factors.  It  also  provides  a  condition 
under  which  we  can  compute  Q\W)  from  Q[C],  where  W  is  a  subset  of  C,  by 
simply  summing  Q[C]  over  the  remaining  variables  (in  C  \  W),  like  ordinary 
marginalization  in  probability  theory. 
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Lemma  11  Let  H  C  V,  and  assume  that  H  is  partitioned  into  c-components 
Hi, ...  ,Hi  in  the  subgraph  Gh ■  Then  we  have 

(i)  Q[H]  decomposes  as 

Q[H]  =  YlQ[Hi].  (5.88) 

i 


(ii)  Let  k  be  the  number  of  variables  in  H,  and  let  a  topological  order  of  the 
variables  in  H  be  141  <  •  •  •  <  Vhk  in  Gh-  Let  IJT  =  {  V)tl , . . . ,  Vht }  be  the  set 
of  variables  in  H  ordered  before  14,  (including  VjM),  i  —  1 ,...  ,k,  and  H ^  =  0. 
Then  each  Q\H:i\,  j  —  1,  • . . ,  l,  is  computable  from  Q[H }  and  is  given  by 


where  each  Q[H^], 


Qm 


n 

{i\VhieHj} 


Q[H «] 

Q[H(^)y 


0,1 , ...  ,k,  is  given  by 

<2[/f(,))  =  E 

h\h.( 


(5.89) 


(5.90) 


Lemma  11  generalizes  Lemma  7  to  proper  subgraphs  of  G. 

The  use  of  Lemma  11  can  be  shown  with  the  example  studied  in  Section  5.3.2, 
where  the  subgraph  G{x,z }  (Figure  5.8(d))  is  partitioned  into  two  c-components 
{X}  and  {Z},  and  therefore  Q[{Ar}]  and  Q[{Z}]  are  both  computable  from 
Q[{X,Z}}.  We  can  check  that  Eqs.  (5.84)  and  (5.85)  satisfy  (5.89). 

Next,  we  present  a  procedure  for  computing  Px(s )  based  on  Lemmas  7,  10, 
and  11. 


5.3.4  Computing  Px(s) 


Let  V  be  partitioned  into  c-components  Sx ,  5i, . . . ,  Sk,  where  X  G  Sx , 
V  =  V  \  {X}.  We  have 

and  let 

P(v)  =  Q[V]  =  Q[Sx]llQ[Si}, 

i 

(5.91) 

and 

PJv1)  =  Q[V')  -  Q[SX  \  {X}]  H  Q[Si\. 

(5.92) 

We  want  to  compute 

px(s)  =  '£p*(v')  =  J2QIv'}- 

(5.93) 

V'\S  V'\S 
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(5.94) 


Let  D  =  An(S)Gv,-  By  Lemma  10,  Eq.  (5.93)  becomes 

D\S  V'\D  D\S 

Let  Dx  —  D  n  Sx,  and  Dt  =  D  D  Si,  i  =  1, . . . ,  k.  From  Eq.  (5.92),  Q[D\  can  be 
written  as 

Q[D)  =  Q[Dx}\[Q[Di\  (5.95) 

i 

Di  is  an  ancestral  set  in  Gst  from  its  definition,  hence  by  Lemma  10, 

Q[Di]  =  Y  Wi],  i  =  (5.96) 

St\Di 


However,  Dx  may  not  be  an  ancestral  set  in  Gsx  (although  it  is  an  ancestral  set 
in  Gs*\qx})>  because  X  could  be  an  ancestor  of  Dx .  Combining  Eqs.  5.94-5.96, 
we  obtain 


px(s)  =  Y^D^U  E  Qfoi-  (5-97) 

D\S  i  Si\Di 

Assume  that  in  the  graph  GDx ,  Dx  is  partitioned  into  c-components  Dx , . . . ,  Dx . 
Then  Q[DX }  —  YljQ[Dx],  and  we  obtain 

p, w  =  Ell  '•  11  E «&!•  t5'98) 

D\S  j  I  Si\D, 

Since  all  the  c-factors  Q[5j]’s  are  identifiable,  we  obtain  that  Px(s)  is  identifiable 
if  all  Q[DX\ s  are  identifiable. 

Since  Dx  C  Sx ,  Q[Df }  is  identifiable  if  it  is  computable  from  Q[SX].  Next, 
we  study  the  conditions  for  Q[Df)  to  be  computable  from  Q[5X].  Let  F  = 
An(Dx)GsX. 

•  If  F  =  Dx ,  that  is,  if  Dx  is  an  ancestral  set  in  Gsx,  then  by  Lemma  10, 
Q[DX]  can  be  computed  as 

Q[Df]=  Y  (5-") 

Sx\Df 

•  If  F  =  Sx ,  we  are  unable  to  determine  whether  Q[DX)  is  computable  from 
Q[SX]  at  this  moment. 
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Function  Identify  (C,  T,  Q) 

INPUT:  C  C  T  C  V,  Q  —  Q[T}.  Assuming  Gt  is  composed  of  one  single 
c-component. 

OUTPUT:  Expression  for  Q[C]  in  terms  of  Q  or  fail  to  determine. 

Let  A  —  An(C)cT- 

•  IF  A  =  C,  output  Q[C]  =  J2t\c  Q- 

•  IF  A  =  T,  output  FAIL. 

•  IF  C  C  A  c  T 

1.  Assume  that  in  Ga,  C  is  contained  in  a  c-component  V . 

2.  Compute  Q[T']  from  Q[A]  =  J2t\a  Q  by  Lemma  11. 

3.  Output  Identify (C,  T\  Q[T']). 

Figure  5.9:  A  function  determining  if  Q[C]  is  computable  from  Q[T]. 

•  Assume  that  Df  C  F  C  Sx .  By  Lemma  10,  we  have 

Q[F]  =  (5-10°) 

SX\F 

Assume  that  in  the  graph  Gf,  Df  is  contained  in  a  c-component  H  (the 
variables  in  Df  are  connected  by  bidirected  paths  among  themselves  hence 
belong  to  one  same  c-component).  By  Lemma  11,  Q[I1]  can  be  computed 
from  Q[F]  and  thus  is  identifiable.  We  obtain  that  the  problem  of  whether 
Q[Df  ]  is  computable  from  Q[5A]  is  reduced  to  that  whether  Q[Df]  is 
computable  from  Q[H}. 

The  preceding  analysis  gives  a  recursive  procedure  for  determining  whether  Q[Djf] 
is  computable  from  Q[SX}]  at  each  step,  we  either  find  an  expression  for  Q[Df], 
find  the  problem  indeterminable,  or  reduce  the  problem  to  a  simpler  one  in  the 
sense  that  H  C  Sx .  In  general,  let  C  C  T  C  V\  a  recursive  algorithm  for 
determining  if  Q[C)  is  computable  from  Q[T]  is  presented  in  Figure  5.9. 

In  summary,  an  algorithm  for  computing  Px(s)  is  given  in  Figure  5.10.  The 
procedure  consists  of  three  basic  phases.  In  phase- 1,  we  compute  the  expres¬ 
sions  for  all  c-factors  and  find  (graphically)  the  sets  Df  from  the  graph  G.  In 
phase-2,  we  attempt  to  compute  Q[Df\'s  from  Q[SA]  by  calling  the  function 
Identify(Z>A: ,  Sx ,  Q[5A])  given  in  Figure  5.9.  In  phase-3,  if  all  Q[Df]’s  are  com¬ 
putable,  we  output  the  expression  for  Px(s)  given  in  Eq.  (5.98). 
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Algorithm  4  (Computing  Px(s)) 

INPUT:  asetScV. 

OUTPUT:  the  expression  for  Px(s )  or  fail  to  determine. 

Phase- 1: 

1.  Find  the  c-components  of  G:  Sx ,  Si, . . . ,  Sk,  where  X  £  Sx . 

2.  Compute  the  c-factors  Q[SA],  Q[Si], . . . ,  Q[Sk]  by  Lemma  1. 

3.  Let  D  =  An(S)GvX{x},  Dx  =  D  n  Sx . 

4 ■  Let  the  c-components  of  GDx  be  Dx ,j  —  1 

Phase-2: 

For  each  set  Df : 

Compute  Q[DX]  from  Q[5X]  by  calling  the  function  Identify(Dx ,  Sx ,  Q[SX]) 
given  in  Figure  5.9.  If  the  function  returns  FAIL,  then  stop  and  output  FAIL. 

Phd/S  c~"  S  * 

Output  px(s)  =  Ed\S  rij  QlDf)  Ylt  J2s,\D  QlSi\- 

Figure  5.10:  An  algorithm  for  computing  Px(s) 

From  the  preceding  analysis,  we  see  that  the  problem  of  identifying  Px(s)  is 
reduced  to  that  of  computing  Q[C]  from  Q[T]  for  some  sets  C  C  T  C  V,  for 
which  we  give  an  algorithm  in  Figure  5.9.  Now  the  open  problem  is:  Is  Q[C] 
computable  from  Q[T }  if  (i)  Gc  has  only  one  c-component  (C  itself),  (ii)  Gt 
has  only  one  c-component  (T  itself),  and  (iii)  in  Gt,  all  variables  in  T\C  are 
ancestors  of  C  ( An(C)cT  =  T)? 

5.3.5  Useful  graphical  criteria 

We  have  given  a  procedure  for  determining  the  identifiability  of  Px(s)  and  finding 
its  expression  (when  identifiable)  in  Figure  5.10.  Next,  we  give  some  graphical 
criteria  based  on  Algorithm  4  which  can  be  used  for  quickly  judging  the  identifi¬ 
ability  of  Px{s )  by  looking  at  the  causal  graph  G. 

The  idea  lies  in  systematically  removing  certain  nonessential  nodes  from  G  till 
Theorem  17  is  applicable  (or  no  more  nodes  can  be  removed).  First,  Lemma  9  can 
be  used  to  remove  nonancestors  of  S  from  G.  Next,  we  show  that  all  variables 
that  are  not  in  the  same  c-components  as  X  can  be  removed.  To  prove  this 
conclusion,  we  present  a  utility  lemma  first.  Let  A  C  B  C  V.  We  use  Q[A]gb  to 
denote  the  function  Q[A)  =  E[{i| v^a)  P(vi\Pai >  C)P(u)  where  PA[  =  PAiC\B. 
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The  difference  between  Q[A]Gb  and  Q[A]  =  Q[A]Gv  is  that  some  parents  of  A  in 
G  are  removed  in  Gb- 

Lemma  12  Let  A  C  B  C  V.  Q[A]  is  computable  from  Q[B]  if  and  only  if 
Q[A\gb  computable  from  Q[B)Gb- 

Proof:  (only  if)  By  Lemma  8. 

(if)  Proof  by  contradiction.  Assume  that  Q[A]  is  not  computable  from  Q[B], 
then  there  exist  two  models,  Mi  and  M2,  with  the  same  causal  graph  G ,  satisfying 


QM*[P](5,c)  =  E  II  PMKvi\pa'i,Ci,ui)PM>(u)l  k  =  1,2,  (5.101) 

u  { i\ViEB } 

where  PA'  =  PAi  n  B,  Ci  —  PAi  \  B ,  and  C  =  UjCj,  such  that 

QMl[B]{b ,  c)  =  QM2[B](b,  c)  >  0,  for  all  values  b ,  c,  (5.102) 

but 

QMl[A}(b',c')  ^  QM2[A](b' ,  c'),  for  some  particular  value  b',c'.  (5.103) 

Q[B]Gb  can  be  written  as 

Q[B}GB(b)=Y,  II  P{vi\pa\y)P{u).  (5.104) 

U  {i\Vi€B} 

We  construct  two  models,  M[  and  Mf  with  the  same  causal  graph  Gb  as 

PM'k(vi\Pai>  *?)  -  PMk(vi\pa\,  C i  -  c',  u%  k  =  1,  2,  (5.105) 

PM*(u)  =  PMk{u ),  k  =  1,  2.  (5.106) 


Then  we  have 

Q[B}Gk(b)  =  Q[B}Mk(b,c'),  and  Q[A]%(b)  =  Q[A]M“(b,J)t  k  =  1,2.  (5.107) 
From  Eqs.  (5.107),  (5.102)  and  (5.103),  we  obtain 

QM,1{B}aB(b)  =  QMl2[B)oB(b )  >  0,  for  all  values  b,  (5.108) 

and 

QMi[A]GB(b')  ^  QM2[A]GB(b'),  for  the  value  (5.109) 

which  says  that  Q[A)Gb  is  not  computable  from  Q[B\Gb.  □ 

Using  Lemma  12,  we  obtain  the  following  lemma  which  reduces  the  identifiability 
problem  to  some  subgraph  of  G. 
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Lemma  13  Assume  thatX  is  in  the  c- component  Sx ,  and  let  Dx  —  An(S)GVHX}<h 
Sx .  Then  Px(s )  is  identifiable  if  in  the  graph  Gs x,  PX(DX)  is  identifiable. 

Proof:  From  Eq.  (5.97),  Px(s)  is  identifiable  if  Q[DX  ]  is  identifiable.  By  Lemma  12, 
Q[DX ]  is  identifiable  if  Q[Dx]gsX  is  identifiable.  Let  Ex  =  ( Sx  \  Dx)  \  {X}.  In 
Gsx ,  we  have 

px{dx)  =  J2P^SX  \  W)  =  \  mk*  -  q[dx]GsX,  (5.110) 

Ex  Ex 


where  we  used  Lemma  10  in  the  last  step.  Hence  we  obtain  that  Px(s)  is  identi¬ 
fiable  if  PX{DX)  is  identifiable  in  Gsx.  □ 

Lemma  9  and  13  reduce  the  original  problems  of  deciding  the  identifiability  of 
Px(s)  in  G  to  (usually  simpler)  problems  of  identifying  the  causal  effect  of  X 
on  a  different  set  of  variables  in  some  subgraphs  of  G.  If  the  latter  problem  is 
not  recognized  to  be  identifiable  (via  Theorem  17),  we  can  of  course  repeat  the 
process  and  attempt  to  reduce  it  further,  using  Lemma  9  and  13  alternatively.4 
Such  recursive  application  of  Lemma  9  and  13  is  illustrated  in  the  next  example. 

5.3.6  An  example 

Consider  the  problem  of  identifying  Px(y)  in  Figure  5.11(a).  By  Lemma  9,  Px(y) 
is  identifiable  in  Figure  5.11(a)  if  it  is  identifiable  in  Figure  5.11(b),  then  by 
Lemma  13,  if  it  is  identifiable  in  Figure  5.11(c).  After  applying  Lemma  9  and 
13  again  (see  Figure  5.11(d)  and  (e)),  the  problem  is  finally  reduced  to  whether 
Px(y)  is  identifiable  in  Figure  5.11(f),  which  is  obviously  true,  and  we  conclude 
that  Px(y)  is  identifiable  in  Figure  5.11(a). 

We  now  demonstrate  the  use  of  Algorithm  4  by  computing  Px(y)  in  Fig¬ 
ure  5.11(a). 

Phase- 1: 


1.  The  whole  graph  is  one  c-component. 

2.  Dx  =  D  =  .4n({l'))Cv.u,,  =  {F}. 

3.  We  want  to  compute  Px(y)  =  Q[{Y}}. 

Phase-2: 

4Note  that  some  causal  effects  identified  by  Algorithm  4  may  not  be  identified  by  repeatly 
using  Lemma  9  and  13  which  are  meant  for  quick  judgement  only. 
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x - -  Y  X - -  Y 

(e)  Gt2  (f)  Ga3 


Figure  5.11:  Subgraphs  of  G  used  in  computing  Px(y ). 
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1.  Compute  Q[{F}]  by  calling  the  function  Identify ({Y-},  V,  P(v))  in  Fig¬ 
ure  5.9.  Let  A\  =  An({Y})G  =  {X,  F,  WL,W2,  W3 ,W4}.  We  have  {F}  c 
Ai  C  V.  The  graph  GAl  (Figure  5.11(b))  has  two  c-cornponents:  Ti  = 
{X,  Y,  Wi,  W'2,  W3}  and  {W4},  and  we  have 

Q[Ai\  =  X]  P(v)  =  P(ax)  -  Q[Tl}Q[{W,}}.  (5.111) 


A  topological  sort  over  A\  is:  W3  <  W4  <  W 1  <  W2  <  X  <  Y.  By 
Lemma  11,  we  obtain 


eiwi  = 


a[{iv3» 


E 


wi,w2,x,y 


P{a  1) 


E 


WA,Wi,W2  ,x,y 


P(al) 


P(w  4^3), 


(5.112) 


and  from  (5.111), 


Q[Ti]  =  P(a1)/P(-w4|n;3) 

=  P(z,  y,  Wi,  w2\w3,  w4)P(w3) 

=  P(x,  y\wu  w2,  w3,  Wi)P(wi,w2,  w3).  (5.113) 


2.  Call  the  function  Identify({F},  Ti,  Q\T\}). 

Let  A2  =  An({Y})oTi  =  {X,Y,W1,W2}  (see  Figure  5.11(c)).  We  have 
{F}  C  A2  C  Ti.  The  graph  GA2  (Figure  5.11(d))  has  two  c-components: 
T2  —  { X ,  F,  W\}  and  {W2},  and  we  have 

Q[A2]  =  ^Q[E]  -  Q[T2]Q[{W2}).  (5.114) 


A  topological  sort  over  A2  is:  W1  <  W2  <  X  <  Y.  By  Lemma  11,  we 
obtain 


Q[{W2}] 


Q[{Wi}] 


2] 

J2w2,x,yQlA^\ 


P{w2\wi), 


and  from  (5.114)  and  (5.113), 


Q[T2]  =  J2QiTi]/p(w  2W 

W3 


=  ^2  P(x>  y\W  1)  w2,  W3,  W4)P(w3\w  1,  W2)P(wi). 

W3 


(5.115) 


(5.116) 
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3.  Call  the  function  Identify ({F},  T2,  Q[T2]).  Let  A3  =  An({Y})oT2  —  {X,  Y} 
(see  Figure  5.11(e)).  We  have  {F}  C  A3  C  T2.  The  graph  GAi  (Fig¬ 
ure  5.11(f))  has  two  c-components:  {X}  and  {F},  and  we  have 

Q[a3]  =  =  aWMin]-  (5-117) 

w  i 

The  only  admissible  order  over  A3  is:  X  <  F.  By  Lemma  11,  we  obtain 
<?[mi  =  EEopy  =  E  P(x\w1,  W2,  w3,  W4)P(w3\Wi,W2)P(wi), 

y  w  i  w  iyWs 

and 

e[{i/}]  =  (E<3[I’J)/Q[{A'}] 

Wl 

_  Y.WUW3  P(x’  V N>  W Wl)P(w 3|wi,  W2)P(w i) 

EW1iW3  ™2,  ™3,  W4)P(W3\W1,W2)P('W1) 

Phase-3: 

Finally,  we  obtain 

.  _  nr-fFll  =  y\w^w2,  w3,  w4)P{w3\wuw2)P(w1) 

XV  Y.wuw3P(x\wi,W2,l03,W4)P{w3\Wi,W2)P{w1) 

5.3.7  Galles&Pearl’s  graphical  criterion  vs.  do-calculus 

[GP95]  claimed  that  their  graphical  criterion  will  embrace  all  cases  where  identi¬ 
fication  is  verifiable  by  do-calculus.  Here  we  show  that  their  criterion  is  not  com¬ 
plete  in  this  sense.  Consider  the  problem  of  identifying  Px(z)  in  Figure  5.8(a). 
Neither  “back-door”  nor  “front-door”  criterion  is  applicable.  The  graphical  cri¬ 
terion  in  [GP95]  also  fails  because  there  is  no  set  which  can  block  all  back-door 
paths  from  X  to  F.  However  we  have  that  Px(z )  =  Q[{Z})  is  identifiable  and  is 
given  in  Eq.  (5.85).  Px(z )  can  also  be  computed  by  do-calculus  as 

P{z\x)  —  P{z\x,  W\) 

=  P(z\x,  Wi) 

=  P(z\x,W2,Wi) 

_  P(z,x,w2\wi) 

P(x,w2\wi) 

_  P{z,x\w2,wi)P(wi) 

EW1  P(x\w2,Wi)P{wi) 


(5.121) 

(5.122) 

(5.123) 

(5.124) 

(5.125) 


(5.118) 


(5.119) 


(5.120) 
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Hence  we  see  that  the  graphical  criterion  in  [GP95]  is  not  complete  with  respect 
to  do-calculus.  [GP95]  may  have  failed  to  consider  the  possibility  of  removing  a 
hat  by  transforming  Eq.  (5.123)  to  (5.124). 


5.4  Identification  of  Pt{s) 


So  far,  we  have  assumed  that  intervention  is  applied  to  a  single  variable  X.  In 
this  section  we  study  the  problem  of  identifying  Pt(s)  where  S  and  T  are  arbitrary 
(disjoint)  subsets  of  V.  We  will  show  that,  as  for  identifying  Px(s),  the  problem 
of  identifying  Pt(s)  is  also  reduced  to  identifying  Q{C)  from  Q[C]  for  some  sets 
C  C  C' ,  and  we  give  a  procedure  for  computing  Pt(s). 

5.4.1  Computing  Pt(s) 

Let  V  =  V  \  T,  we  want  to  compute 


aw  =  Ep<(t')  =  EcPl 

T'\S  T'\S 

Let  D  —  Au(S)gt,-  Then  by  Lemma  10, 

D\ST'\D  D\S 

Let  V  be  partitioned  into  c-components  Si,...,Sk,  and  let  D%  —  D 


(5.126) 

(5.127) 

n  Si,i  = 

(5.128) 

D\S  i 

We  obtain  that  Pt(s )  is  identifiable  if  all  Q[Dj\' s  are  identifiable.  Assume  that 
the  graph  Gd,  is  partitioned  into  c-components  Ai,  •  ■  • ,  Diki.  Then 


uct  v  u c  pa i  ui Liyyiiewu.  muu 

1, . . . ,  k.  Eq.  (5.127)  can  be  rewritten  as 


-components  A i,  •  •  • ,  Dlki.  Then 
Q[Di]  =  l[Q[Dij\,  i  =  1, . . .  ,k. 


We  obtain 


p(»)=EnnQ[Ay 

D\S  i  j 


(5.129) 


(5.130) 


Hence  Pt(s )  is  identifiable  if  all  Q[Aj]’s  are  identifiable.  Whether  Q[Aj]  is 
identifiable  can  be  determined  by  using  the  function  Identify (Aj,  Si,  Q[Si])  given 
in  Figure  5.9. 
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Algorithm  5  (Computing  Pt(s)) 

INPUT:  two  disjoint  sets  S,  T  C  V. 

OUTPUT:  the  expression  for  Pt(s )  or  fail  to  determine. 

Phase-1: 

1.  Find  the  c- components  of  G:  Si, ...  ,Sk- 

2.  Compute  the  c-factors  Q[5j], . . . ,  Q[Sk\  by  Lemma  7. 

3.  Let  D  =  An(S)cVXT,  P>i  =  D  fl  Si,  i  =  1, . . . ,  k. 

4.  Let  the  c-components  of  G  n,  be  D%j,  j  —  1, ...  ,k,,  i  —  1 ,...  ,k. 

Phase-2: 

For  each  set  D.ty. 

Compute  Q[Dij]  from  Q[Si]  by  calling  the  function  Identify(Dij,  Si,  Q[Si])  in 
Figure  5.9.  If  the  function  returns  FAIL,  then  stop  and  output  FAIL. 

Phase-  3: 

Output  Pt(s )  =  J2d\s  EL  IT;  Q\Dii\- 

Figure  5.12:  An  algorithm  for  computing  Pt(s) 

In  summary,  an  algorithm  for  computing  Pt(s)  is  given  in  Figure  5.12.  The 
procedure  consists  of  three  basic  phases.  In  phase-1,  we  compute  the  expressions 
for  all  c-factors  and  find  (graphically)  the  sets  DtJ  from  the  graph  G.  In  phase-2, 
we  attempt  to  compute  Q[Dtff  s  by  calling  the  function  Identify(Dy,  Si,  Q[<SV|) 
given  in  Figure  5.9.  In  phase-3,  if  all  Q[Dtj]' s  are  identifiable,  we  output  the 
expression  for  P£(s)  given  in  Eq.  (5.130). 

5.4.2  Useful  graphical  criteria 

Next,  we  give  some  graphical  criteria  for  quick  judgement  of  the  identifiability  of 
Pt(s)  by  looking  at  the  causal  graph  G.  First  we  give  some  graphical  conditions 
for  identifying  Pt(v)  =  Pt{v  \  t ),  the  causal  effect  of  T  on  all  other  variables  in  V. 
The  following  criterion  is  a  corollary  of  Lemma  7. 

Theorem  18  If  there  is  no  bidirected  edge  connecting  variables  in  a  set  T  to 
variables  not  in  T,  then  Pt(y )  is  identifiable.  Let  a  topological  order  over  V  be 
V\  <  ...  <  Vn,  and  let  =  {Vj, . . . ,  V)},  i  =  1 ,...  ,n,  and  V ^  =  0.  Then 
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Pt(v)  is  given  by 


Pt(v\t)=  P(vi\pa(Ci)\{vi}),  (5.131) 

{*|v,ev\r} 

where  Ci  is  the  c-component  of  Gvo)  that  contains  L). 

In  general,  let  T'  =  V  \  T ,  let  V  be  partitioned  into  c-components  Si, ...  ,  Sk, 
and  let  T*  =  T  fl  Sj,  T[  —  V  n  Si,  i  =  1 , . . .  ,k.  We  have 

Pt{t')  =  Y[Q{Tf).  (5.132) 

i 

Hence  Pt(t')  is  identifiable  if  and  only  if  each  Q[T'}  is  computable  from  Q[Si).  On 
the  other  hand,  we  have 

ptj  (V  \  tj)  =  Q[Tj]  n  Q[Si}-  (5.133) 

Hence  Ptj  ( v\tj )  is  identifiable  if  and  only  if  Q[Tj]  is  computable  from  Q[Sf\.  And 
we  obtain  the  following  lemma. 

Lemma  14  Let  V  be  partitioned  into  c-components  Si, ... ,  Sk,  and  let  T,  = 
T  0  Si,  i  =  1, . . . ,  k.  Pt(v)  is  identifiable  if  and  only  if  each  Ptfiv),  i  =  1, ...  ,k, 
is  identifiable. 

In  the  subgraph  Gsj , 

P{sj)  =  Q[S7]GSj  ,  and  Ptj  (. Sj  \  tj)  =  Q[Tj\Gs.  ■  (5.134) 

Hence  by  Lemma  12,  Q[Cj]  is  computable  from  Q[Sj\  if  and  only  if  Ptj  (sj  \  tj )  is 
identifiable  in  Gsj,  which  gives  the  following  lemma. 

Lemma  15  Let  Si  be  a  c-component  of  G,  and  Tj  C  Sj.  St;(u)  is  identifiable  if 
and  only  if  Ptfisf)  is  identifiable  in  the  graph  Gsr 

One  simple  condition  for  Q[T(]  to  be  computable  from  Q[Sj]  is  that  T[  is 
an  ancestral  set  in  Gs, ,  or  T,  contains  its  own  descendants  in  Gsr  Under  this 
condition,  by  Lemma  10, 


QIP]  =  ^Q(s,]. 

Ti 


(5.135) 


And  we  obtain  the  following  theorem. 
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Theorem  19  Let  Si  be  a  c- component  of  G,  and  Tj  C  5*.  If  the  children  of  vari¬ 
ables  in  Tj  are  either  in  Tj  or  outside  of  Si  (i.e.  Tj  contains  its  own  descendants 
in  GsJ,  then  Pt,{v )  is  identifiable,  and  is  given  by 

^\‘<>  =  HEW  (5.136) 

Next,  we  give  some  graphical  conditions  for  quick  judgment  of  the  identifia- 
bility  of  Pt(s). 

Lemma  16  Let  V  be  partitioned  into  c-components  Si, ,  Sk ■  Let  Tj  =  T  fl 
Si,  Di  =  An(S)GvXT  fl  Si,  i  —  1 ,...  ,k.  Then  Pt(s )  is  identifiable  if  every  Pti  {df) 
is  identifiable  in  Gsi  for  i  =  1, . . . ,  k. 

Proof:  From  Eq.  (5.128),  Pt(s)  is  identifiable  if  each  Q[Df\  is  identifiable.  By 
Lemma  12,  Q[D.f\  is  computable  from  <2[Sj]  if  Q[A]gs  is  computable  from 
Q[5j]gS!-  Let  Tj'  =  Si\  Tj.  In  GSi,  we  have 

PM)  =  E  «.<«  =  E  «Klos,  -  (5.137) 

T!\Di  Tf\Di 

where  we  used  Lemma  10  in  the  last  step.  Hence  we  obtain  that  Pfis)  is  identi¬ 
fiable  if  each  Ptfidfi  is  identifiable  in  Gsr  LI 


Lemma  17  Let  T\  =  T  fl  An(S).  Pt(s)  is  identifiable  if  and  only  if  Ptl(s)  is 
identifiable  in  Gau(S)- 

Proof:  It  is  well-known  that  Pt(s)  =  Ptl(s).  The  rest  of  the  proof  is  the  same  as 
Lemma  9.  □ 

Lemma  16  and  17  reduce  the  original  problems  of  deciding  the  identifiability 
of  Pt(s )  in  G  to  some  (usually  simpler)  identifiability  problems  in  subgraphs  of  G. 
They  can  be  repeatedly  applied  to  further  reduce  the  problems,  till  inapplicable  or 
till  those  problems  are  recognized  to  be  identifiable  (for  example,  via  Theorem  17 
or  19). 

5.4.3  Examples 

Next,  we  study  some  examples,  to  illustrate  the  use  of  Algorithm  5  and  the 
graphical  criteria  in  Section  5.4.2. 
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Figure  5.13:  By  Lemma  16,  Px lX2(y)  is  identifiable  if  PXl{y )  is  identifiable  in  Gs- 

Consider  the  problem  of  identifying  PXlX2{v)  in  Figure  5.13(a),  which  was 
studied  in  [PR95].  G  has  two  c-components  S  —  {X\ ,Z,Y)  and  {X2},  and  X\ 
and  X2  are  in  different  c-components.  Letting  C  —  V  \  {AT,  X2}  —  {Y,  Z},  then 
Ati({Y})gc  =  {l7}  C  5.  By  Lemma  16  we  have  that  PXlx2i.y )  is  identifiable  if 
PXl(y)  is  identifiable  in  the  subgraph  Gs  (Figure  5.13(b)).  Since  the  latter  is 
true  by  Theorem  17,  we  conclude  that  PXlX2(y )  is  identifiable.  Next  we  compute 
PxlX2(y)-  We  have 

P(v)  =  P(x2\xuz)Q[S\,  (5.138) 

from  which  we  obtain 

Q[S]  =  P{v)/P(x2\xi,z)  =  P(y\xi,x2,z)P(xi,z).  (5.139) 

PXlX2(y)  is  computed  as 

ftmW  =  ^Q[{!',Z}]  =  <3[{y}],  (5.140) 

z 

which  can  be  computed  by  calling  Identify ({F},  S,  <3[5])  in  Figure  5.9.  Let  A  — 
An({Y})Gs  =  {Xi,Y}.  We  have  {F}  C  A  C  S.  The  graph  Ga  has  two  c- 
components:  {AT}  and  {F},  and  we  have 

Q[A]  =  W  (5-141) 

z 

The  only  admissible  order  over  A  is:  AT  <  F.  By  Lemma  11,  we  obtain 

Q[{Xi}\  =  (5-142) 

y  z 
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Y 


Figure  5.14:  By  Lemma  16,  PXlX2(y )  is  identifiable  if  PX2(y )  is  identifiable, 
and 

ami  =  Emi/am}]  =  im  (5.143) 

Finally,  we  obtain 

Pxix2(y )  =  Q[{^}]  =  5^P(y|a:i,X2,2)i3(^i)»  (5.144) 

z 

which  coincides  with  Eq.  (4.3)  of  [PeaOO,  page  122], 

Consider  the  problem  of  identifying  PXlX2(y)  in  Figure  5.14,  which  was  studied 
in  [PR95].  G  has  two  c-components  S  =  {X2,W,Y}  and  {Xi},  and  X\  and 
X2  are  in  different  c-components.  Letting  C  =  V  \  {X\,  X2}  =  {Y,  W],  then 
An({Y})Gc  =  {Y}  C  S.  By  Lemma  16,  PXlx2(y)  is  identifiable  if  PX2{y )  is 
identifiable  in  Gs-  It  is  clear  that  PX2(y )  is  identifiable  (by  Theorem  17),  hence 
PXlX2(y)  is  identifiable. 

Consider  the  problem  of  identifying  PXlX2(y)  in  Figure  5.15,  which  was  studied 
in  [PR95].  G  has  three  c-components  {Xi},{F},  and  S  =  {X2,  Zi,  Z[},  and  X1 
and  X2  are  in  different  c-components.  By  Lemma  14,  PXlX2(v)  is  identifiable  if 
both  PXl  (v)  and  PX2(v)  are  identifiable,  which  is  true  by  Theorem  15.  Therefore 
Px  1x2 (v)  is  identifiable.  Next  we  compute  PxlX2(v).  We  have 

p(v)  -  P(xi\zi)P(y\x2,  z[)Q[S\,  (5.145) 

from  which  we  obtain 

Q[S\  =  P{v) / {P{xl\zl)P{y\x2,  z[))  =  P(x2,z[\x1,z1)P(z1).  (5.146) 
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are 


Figure  5.15:  By  Lemma  14,  PXlX2  (v)  is  identifiable  if  both  PXl  (v)  and  PX2{V) 
identifiable. 


Figure  5.16:  The  problem  of  identifying  PXlX2(y)  (from  [KM99]). 


PX1X2  (v)  is  computed  as 

Px  i*2  (y.*i.  ~i) 


^(y|®2,  A)  53 

P(y|x2,  z'1)P(^|a:i,  zi)P(zi) 
P(y|x2,^i)P(zi,2i). 


(5.147) 


Next,  consider  the  problem  of  identifying  PXlX2(y)  in  Figure  5.16,  which  was 
studied  in  [KM99].  X1  and  X2  are  in  the  same  c-component  S  =  {Xi,X2,  Y},  and 
their  children  other  than  X2  itself  are  not  in  S,  hence  Theorem  19  is  applicable 
and  PXlX 2(y)  is  identifiable.  We  have 

P(v)  =  P(zi\x1)P{z2\xux2)Q[S},  (5.148) 

from  which  we  obtain 

Q[S]  =  P(v)/(P(z1\x1)P(z2\x1,x2)) 

=  P{.y\xi,x2,z1,z2)P(x2\xl,zl)P(xl).  (5.149) 
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Figure  5.17:  The  problem  of  identifying  PXlX2(y)  (from  [KM99]). 

From  Theorem  19,  we  have 
Px1x2(y,zi,z2)  =  P(zi\xi)P(z2\x1,x2) 

^1,^2 

=  P(-Zi|a;i)P(z2|a:i,a:2)  ^  P(y\x[,  x'2,  zu  z2)P{x'2\xll,  zi)P(x\). 

,x2 

(5.150) 

We  further  obtain 

PxlX2(y)  =  5^P(2l|Si)P(Z2|Zl,Z2)  Y2  P(y\X'vX2’Zl’Z2)P(X2\X'l,Zl)P(Xl)i 

Z  lyZ2  x\  ,xL 

(5.151) 

which  coincides  with  Eq.  (3.12)  of  [KM99]. 

Consider  the  problem  of  identifying  PXlX2(y)  in  Figure  5.17(a),  which  was 
studied  in  [KM99],  X\  and  X2  are  in  the  same  c-component  S  =  { Xi,X2 ,  F}.  By 
Lemma  15,  PXlX2{y)  is  identifiable  if  pXlxAv) 18  identifiable  in  Gs  (Figure  5.17(b)). 
Let  A  —  An({Y})Gs  =  {Xl,F}.  By  Lemma  17,  PXlX2(y)  is  identifiable  in  Gs 
if  PXl(y)  is  identifiable  in  the  subgraph  GA  (Figure  5.17(c)).  Since  PXl (y)  is 
obviously  identifiable  in  G/ 1,  we  conclude  that  PxlX2(v)  is  identifiable.  We  have 

P(v)  =  P(z2\Xl,x2)Q[S\,  (5.152) 

from  which  we  obtain 

Q[S]  =  P(v)/P(z2\x1,x2)  =  P(y\z2,xi,x2)P(x1,x2).  (5.153) 

Px\x2  iv)  is  computed  as 

Pxux2(z2,y)  =  P(z2\xux2  )Q[{F}].  (5.154) 

Q[{Y})  can  be  computed  by  calling  Identify({F},  5,  Q[,S])  in  Figure  5.9.  We  have 

Q[A]  =  -  <2[{*i}M(F}],  (5.155) 
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(c)  Gs> 


Y 


X2 - -  Y  X2 - -  Y 

(a)  G  (b)  Gs 


X, 


Figure  5.18:  Subgraphs  used  in  identifying  PXlX2{w,y)  in  G. 


from  which  we  obtain 


=  =  (5156) 
y 


and 


Q[{F}]  =  ^ZQ\s}IQ[{xi}\  =  ^2P(y\z2,xi,x2)P(x2\x1).  (5.157) 

X2  X2 

Finally,  substituting  (5.157)  into  (5.154),  we  obtain 

Pxux2(z2,y)  =  P{z2\xi,x2)Y^Piy\z2^xi^x2)P(x2\xi)^  (5.158) 


and 


pxi,x2{y)  =  ^2  7  (z2  j  X  [ .  X2  )  ^  ^  P  ( y  |  ~2 1  X  x ,  .1 2  )  P  (^2  [  1 )  ,  (o.l59) 

Z2  x'2 

which  coincides  with  Eq.  (3.21)  of  [KM99]. 

In  the  examples  studied  so  far,  in  Figure  5.13(a),  5.14,  and  5.15,  PX1X2  (y) 
can  be  identified  using  the  criteria  given  in  [PR95].  In  Figure  5.16  and  5.17(a), 
pXlxM  can  be  identified  by  the  extended  front-door  criterion  and  the  mixed- 
door  criterion  given  in  [KM99]  respectively.  Next  we  give  an  example  shown  in 
Figure  5.18(a),  for  which  PXlX2{w,y)  is  identifiable,  but  none  of  the  criteria  in 
[PR95]  and  [KM99]  is  applicable.  X\  and  X2  are  in  the  same  c-component  S  = 
{Xi,  X2,  Y}.  By  Lemma  15,  PXl X2(u)  is  identifiable  if  PXlX2(y)  is  identifiable  in  Gs 
(Figure  5.18(b)).  The  latter  is  obviously  true,  hence  we  conclude  that  PXlX2(w,  y) 
is  identifiable.  (Formally,  let  S'  =  An({Y})Gs  =  {X2,  Y}:  by  Lemma  17,  PxlX2(y) 
is  identifiable  in  Gs  if  PX2{y)  is  identifiable  in  the  subgraph  Gs1  (Figure  5.17(c)), 
which  is  obvious.) 
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5.4.4  Identification  of  direct  effects  Ppay(y ) 

Let  Y  be  a  single  variable  and  let  Vy  =  V  \  {T}  be  the  set  of  all  other  variables. 
A  special  case  of  the  identifiability  problem  is  to  identify  the  direct  effect  PVy  (y). 
We  have 

Pvv(y)  =  PPay(y)  =  QliY}}-  (5-16°) 

Let  Y  be  in  the  c-component  SY .  In  general,  the  identifiability  of  Ppay  (y)  can 
be  determined  by  using  the  function  Identify({T},  SY  ,  <3[5r])  in  Figure  5.9.  In 
this  section  we  give  some  graphical  criteria  for  determining  whether  Ppay{y )  is 
identifiable. 

Theorem  20  If  Y  is  not  connected  to  bidirected  links,  then  Ppay(y)  is  identifi¬ 
able,  and  is  given  by 

PPav(y)  =  P(y\pay).  (5.161) 

Theorem  20  is  obvious.  The  use  of  Theorem  20  can  be  shown  by  identifying  the 
direct  effect  on  Y  in  Figure  5.15.  Theorem  20  says  that  PX2,z'1{y )  is  identifiable 
and  is  equal  to  P(y\x2,z[). 

Theorem  21  Let  Y  be  in  the  c-component  SY .  If  there  is  no  bidirected  path 
connecting  Y  and  any  of  its  parents  (i.e.,  Y  is  not  in  the  same  c-components 
with  any  of  its  parents),  then  Ppa  {y )  is  identifiable,  and  is  given  by 

Ppay  (y)=  Q[SY].  (5.162) 

Proof:  Since  none  of  the  variables  in  SY  \{y}  is  an  ancestor  of  Y  in  the  subgraph 
Gsy  ,  by  Lemma  10,  Q[{F}]  =  J2sy\{y}  n 

We  demonstrate  the  use  of  Theorem  21  by  identifying  the  direct  effect  on  Y  in 
Figure  5.16.  Y  is  in  the  c-component  S  =  {Xi,X2,Y},  and  Q[S]  is  given  in 
Eq.  (5.149).  By  Theorem  21,  PZuZ2(y )  is  identifiable  and  is  given  by 

Pzuzfiy)  =  P(y\xi,X2,zl,z2)P(x2\x1,z1)P(x1).  (5.163) 

®1,®2 


Lemma  18  The  direct  effect  on  Y  is  identifiable  if  and  only  if  the  direct  effect 
on  Y  is  identifiable  in  Gau{{y})- 

Lemma  18  follows  from  Lemma  17. 

Lemma  19  Let  Y  be  in  the  c-component  Sv .  The  direct  effect  on  Y  is  identifi¬ 
able  if  and  only  if  the  direct  effect  on  Y  is  identifiable  in  G$y- 
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^  \ 


Figure  5.19:  A  graph  in  which  the  direct  effect  on  Y  is  unidentified. 

Proof:  By  Lemma  12,  Q[{T}]  is  computable  from  Q[S'V]  if  and  only  if  Q[{Y})g  Y 
is  computable  from  Q[Sy]gsY  ■  1=1 

Lemma  18  and  19  can  be  applied  alternatively  to  remove  nodes  from  a  graph, 
until  it  is  clear  that  the  direct  effect  on  Y  is  identifiable  or  until  neither  lemmas 
is  applicable.  This  leads  to  the  following  criterion. 

Theorem  22  The  direct  effect  on  Y  is  identifiable  if  there  exists  no  subgraph  Gs 
of  G  satisfying  all  of  the  following:  (i)  Y  £  S;  (ii)  Gs  has  only  one  c-component, 
S  itself;  (in)  All  variables  in  S  are  ancestors  ofY  in  Gs- 

The  graph  in  Figure  5.19  satisfies  conditions  (i)-(iii),  and  for  general  graphs 
of  such  a  type,  we  are  unable  to  determine  the  identifi ability  of  the  direct  effect 
on  Y . 


5.5  Identification  of  Pt(s\c ) 

Let  T\S,C  C  V.  In  this  section,  we  study  the  problem  of  identifying  Pt(s\c).  This 
problem  is  important  for  the  identifiability  of  conditional  plans,  where  action  T 
is  taken  in  response  to  observation  C  [PeaOO,  chapter  4]. 

We  have 

P,a\c)  =  (5.164) 

pt{c ) 

Therefore,  Pt(s|c)  can  be  identified  by  identifying  Pt(s,c)  and  Pt(c )  using  the 
method  in  Section  5.4.  Pt(s\c)  is  identifiable  if  Pt(s,  c)  is  identifiable.  Pt(s\c)  is 
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not  identifiable  if  Pt(s,c)  is  not  identifiable  but  Pt(c )  is.  If  neither  Pt{s,c )  nor 
Pt{c)  is  identifiable,  Pt(.sjc)  may  still  be  identifiable  if  the  non-identifiable  terms 
are  canceled  out  in  the  expressions  for  Pt(s,c)  and  Pt(c )  computed  as  shown  in 
Section  5.4.1.  Next,  we  study  conditions  for  this  canceling  out  to  happen. 

First  we  compute  an  expression  for  Pt{s,c )  with  the  procedure  shown  in  Sec¬ 
tion  5.4.1.  Assume  that  V  is  partitioned  into  c-components  Si, . . . ,  Sk-  Let 
D  =  An(S  U  C)Gvxt,  F  =  D  \  (5U  C),  and  let  A  =  Dn  Si,  i  =  1, . . . ,  k.  Assume 
that  each  subgraph  GDi  is  partitioned  into  c-components  At, . . . ,  Dlkt .  Then  we 
have  (see  Eq.  (5.130)) 

Pt(s,c)  =  J2  n^]-  (5.165) 

F  i,j 

The  identifiability  of  Q[A/’S  can  be  determined  by  calling  the  function 
Identify ( Aj,  >$/ Q [A])  given  in  Figure  5.9.  Let  A/s  be  put  into  two  sets:  in  Hl 
if  Q[Dij\  is  identified  and  in  Hn  if  not  identified  (via  the  function  Identify(., .,.)). 
Eq.  (5.165)  can  be  rewritten  as 

pm=y,(  n  e[A,])(  n  oid«])  (5  i66) 

F  DijdHn  DijelF 

This  summation  over  F  can  sometimes  be  decomposed  into  a  product  of  summa¬ 
tions  as 

p,M  =  (Y,  II  eiD«])(E  II  (s.167) 

F0  DijtElPHjH0  Fi  DijeH1 

where  F  is  partitioned  into  two  sets  F0  and  A,  and  Hl  is  patitioned  into  two  sets 
H°  and  H1.  This  partition  of  F  and  IP  can  be  determined  as  follows,  using  the 
fact  that  each  Q[A/  is  a  function  of  Pa(Dij). 

1.  Let  Fq  —  F  C l  P a{Dij),  F\  =  F  \  Fq,  and  H 1  =  Hl. 

2.  For  each  DZJ  e  H1,  if  Pa(Aj)  n  F0  /  0,  then  remove  Dtj  from  Hl  and  put 
it  into  II0. 

3.  Let  G  =  Fi  fi  UDtjem>Pa(A/-  If  G  is  not  empty,  remove  variables  in  G 
from  Fi  and  put  them  into  F0.  Then  go  back  to  step  2.  If  G  is  empty,  then 
stop,  and  the  partition  process  is  finished. 

Now  since  Pt(c)  =  )T/  A(s,  c),  if  none  of  the  variables  in  S  appears  in  the 
terms  in  Y\Dlj£HnvjH°  Q[Aj]>  that  is,  if  S  fl  U£)ij.eH"Uff0^>a(Aj)  =  0,  then 

cm=<e  n  o[b«i)e<e  n  (5'i6s) 

F0  DijeH^-UH0  S  Fi  Dy-eiT1 
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and  Pt(s\c)  is  identifiable  as 


Pt(s,c)  __  riDijg H1  Q{Djj} 

Pt{c)  Y2s(Y2fi  YlDijGH1  Q[Dij]) 


(5.169) 


In  summary,  an  algorithm  for  computing  Pt(s|c)  is  given  in  Figure  5.20. 
The  procedure  consists  of  four  basic  phases.  In  phase-1,  we  compute  the  ex¬ 
pressions  for  all  c-factors  and  find  (graphically)  the  sets  A:?  and  F  from  the 
graph  G.  In  phase-2,  we  attempt  to  compute  Q[Aj]’s  by  calling  the  function 
Identify (Aj)  Si,  Q[5j]),  and  put  A/s  into  two  sets:  IP  if  identifiable  and  Hn  if 
not.  In  phase-3,  we  partition  F  into  two  sets  and  Hl  into  two  sets.  In  phase-4, 
when  certain  conditions  are  met,  we  output  the  expression  for  P,  (s j c)  given  in 
Eq.  (5.169). 


5.6  Beyond  Semi-Markovian  Models 

In  Sections  5. 2-5. 5  we  have  studied  the  identifiability  problem  in  semi-Markovian 
models.  Our  method  is  based  on  the  decomposition  of  P{y)  into  a  product  of 
c-factors  and  Lemmas  7,  10,  and  11.  Chapter  4  has  shown  that,  in  a  Markovian 
model  with  arbitrary  sets  of  unobserved  variables,  P(v)  can  still  be  decomposed 
into  a  product  of  c-factors  and  that  properties  as  given  in  Lemmas  7,  10,  and 
11  hold  as  well  (see  Corollary  1,  Lemma  2,  and  Lemma  3).  Therefore,  we  can 
use  the  same  method  developed  in  Sections  5. 2-5. 5  to  identify  causal  effects  in  a 
Markovian  model  with  arbitrary  sets  of  unobserved  variables.  In  fact,  instead  of 
working  directly  with  a  complicated  model  with  arbitrary  unobserved  variables, 
we  may  work  with  its  semi-Markovian  projection  defined  in  Section  4.5.  It  is 
shown  in  Section  4.5  that  G  and  its  projection  PJ{G,V)  have  the  same  topo¬ 
logical  relations  over  V  and  the  same  partition  of  V  into  c-components.  Based 
on  these  results,  we  conclude  that  if  Pt(s)  is  identified  in  PJ(G,V)  (using  the 
methods  in  Sections  5. 2-5. 5),  then  it  is  identified  in  G  with  the  same  expression. 

In  summary,  to  identify  a  causal  effect  Pt(s)  in  a  model  with  arbitrary  unob¬ 
served  variables,  we  first  construct  the  projection  graph  PJ(G,  V ),  then  attempt 
to  compute  Pt(s)  in  PJ(G,  V).  If  Pt(s)  is  computable  in  PJ(G,  V),  then  Pt(s)  is 
identifiable  in  G  with  the  same  expression. 


5.7  Conclusion 

We  developed  a  new  method  for  inferring  causal  effects  based  on  the  concept  of 
c-component.  Using  the  method,  we  established  some  powerful  graphical  criteria 
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Algorithm  6  (Computing  Pt(s\c)) 

INPUT:  three  disjoint  sets  T,S,C  C  V. 

OUTPUT:  the  expression  for  Pt(s\c)  or  fail  to  determine. 

Phase-1: 

1.  Find  the  c-components  of  G:  Si, . . .  ,Sk- 

2.  Compute  the  c-factors  <3[Si],  •  •  •  by  Lemma  7. 

3.  Let  D  =  An(S  U  C)GvXT,  F  =  D  \  (S  U  C) ,  D,  =  D  n  Su  i  =  1, . . . ,  k. 
4-  Let  the  c-components  of  Gd,  be  D^,  j  =  1, . . . ,  hi,  i  =  1, . . . ,  k. 

Phase-2: 


1.  For  each  set  Dij : 

Call  the  function  Identify(Dij ,  Si,  Q[Si\)  in  Figure  5.9.  If  the  function  returns  FAIL, 
then  put  into  the  set  Hn,  otherwise  put  Dt]  into  the  set  Hl . 

2.  If  Hn  is  empty,  then  stop  and  output 


Pt(s\c) 


Y)f  UiJ 

J2s  Sf  Ui,j  Q[Pij\ 


Phase-3: 

1.  Let  F0  =  F  D  U DijeH»  Pa(Dij),  F,  =  F  \  F0,  and  H1  =  H\ 

2.  For  each  Dij  €  H1 :  if  Pa(D,j)  n  Fo  ^  0,  then  remove  Dij  from  H1  and  put  it  into  H°. 

3.  Let  G  —  Fi  r\L)£>ijSi{oPa(Dij). 

If  G  7^  0,  then  remove  variables  in  G  from  Fi  and  put  them  into  Fq.  Go  back  to  step  2. 
If  G  =  0,  go  to  Phase-4- 

Phase-4: 

If  S  r\U£>ijeHnuH°Pa{Dij)  =  0,  then  output  the  expression  for  Pt(s\c)  as  given  in  Eq.  (5.169), 
otherwise  output  FAIL. 


Figure  5.20:  An  algorithm  for  computing  Pt(s|c) 


106 


for  ensuring  the  identifiability  of  causal  effects  and  developed  procedures  that 
systematically  identifies  causal  effects. 
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CHAPTER  6 


Identification  of  Causal  Effects  in  Linear  Models 


In  Chapter  5,  we  studied  the  identification  problem  in  nonparametric  models, 
that  is,  we  did  not  make  any  assumptions  about  the  functional  forms  of  how  the 
variables  interact  with  each  other.  In  this  chapter,  we  study  the  identification 
problem  in  linear  models,  in  which  we  assume  that  all  interactions  among  vari¬ 
ables  are  linear.  We  will  show  how  the  identifiability  results  in  nonparametric 
models  can  be  used  to  identify  causal  effects  in  linear  models. 

6.1  Linear  Models 

A  linear  recursive  model  over  a  set  of  variables  V  =  {Vi, . . . ,  Vn}  is  given  by  a 
set  of  linear  equations 

Vi  —  ^  ) (‘ij Vj  -|-  Cj ,  i  1, . . .  ,n,  (6.1) 

j<i 

where  Cy  is  called  a  path  coefficient,  and  e*  represents  an  “error”  term  and  is 
assumed  to  have  normal  distribution.  Without  loss  of  generality,  we  assume  that 
the  model  is  standardized  as 

E[Vi\  =  E[ei}  =  0,  f  —  1, . . . ,  n,  (6.2) 

where  E[]  represents  the  expectation. 

A  linear  model  can  be  represented  by  a  DAG  G  with  bidirected  links,  called  a 
causal  graph ,  as  follows.  There  is  a  direct  link  from  Vj  to  Vi  in  G  if  the  coefficient 
of  Yj  in  the  equation  for  1)  is  not  zero  (cy  ^  0).  There  is  a  bidirected  link  between 
Vi  and  Vj  if  the  error  terms  et  and  e.j  have  non-zero  correlation.  Figure  6.1  shows 
a  simple  linear  model  and  the  corresponding  causal  graph  in  which  each  link  is 
annotated  by  the  corresponding  path  coefficient. 

In  linear  models,  the  observed  distribution  P(v)  is  fully  specified  by  a  co- 
variance  matrix  E  over  V.  The  identification  problem  is  that  whether  a  path 
coefficient  cl}  is  computable  from  the  covariance  E  given  the  causal  graph.  The 
problem  has  been  under  study  for  half  a  century.  Some  existing  methods  include 
the  rank  and  order  conditions  [Fis66],  the  instrumental  variable  method  [BT84], 
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X  =  ex 
Z  —  (iX  “I*  €z 
W  =  bZ  +  ew 
Y  —  cX  +  dZ  -f-  e.W  H-  ty 
Cov{tx  ?  €-w )  7~  0 
Cov{ez,ey)  f  0 


Figure  6.1:  A  linear  model. 


and  graphical  methods  [McD97,  PeaOO,  BP02].  In  the  next  section,  we  show  how 
causal  effects  (Pt(s))  are  related  to  path  coefficients,  and  thus  provide  a  tool  for 
the  identification  problem,  extending  the  results  in  [PeaOO]. 

6.2  Causal  Effects 

In  linear  models,  we  define  three  types  of  causal  effects  as  follows.  The  path 
coefficient  c,j  quantifies  the  direct  causal  effect  of  V3  on  14,  and  is  called  a  direct 
effect.  Assume  that  there  is  a  directed  path  p  from  14  to  14  in  the  causal  graph 
G,  then  the  product  of  path  coefficients  along  the  path  p  is  called  the  partial 
effect  of  14  on  14  along  the  path  p,  and  is  denoted  by  PE(p).  Let  P(14)14)  be 
the  set  of  directed  paths  from  14  to  14,  and  let  7  C  P(14,  Vf).  Then  g7  PE(p) 
is  called  the  partial  effect  of  14  on  14  along  the  set  of  paths  7  and  is  denoted  by 
PE( 7).  In  particular,  PE(T(Vk,  14))  is  called  the  total  effect  of  14  on  14  and  is 
denoted  by  TE(  14, 14)- 

The  direct  effects,  partial  effects,  and  total  effects  as  defined  above  can  be 
computed  from  appropriate  causal  effects  Pt(s)  by  computing  expectations.  Let 
E[.\do(t)]  denote  the  expectations  in  the  post-intervention  distribution  Pt(-)-  The 
following  proposition  is  obvious. 

Proposition  2  (Total  Effects)  The  total  effect  ofVk  on  14  can  be  computed  as 

TE(Vk,  14)  =  E[Vz\do(vk)]/vk.  (6.3) 


Let  Paj,  —  {14j , . . . ,  14f}  be  the  set  of  parents  of  14,  we  have 

E[Vi\do(pai)}  =  (6.4) 

j 

from  which  we  have  the  following  proposition. 
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Proposition  3  (Direct  Effects)  The  direct  effect  of  14  on  V  can  be  computed 
as 

3 

Cik  =  w—E{Vl\do(pai)\,  14  G  Pal.  (6.5) 

ovk 

Let  S  —  { , . . . ,  Vim}  be  a  set  of  variables  that  does  not  contain  14  Let  jj 
be  the  set  of  directed  paths  from  Vtj  to  V.  that  does  not  pass  any  variables  in 
S  \  { }  •  Then  we  have 

E[Vi\do(s)]  =  Y/PE(Tj)vij,  (6.6) 

j 

where  we  define  PI7(0)  =  0.  Eq.  (6.6)  leads  to  the  following  proposition. 

Proposition  4  (Partial  Effects)  Given  a  set  of  directed  paths  7  C  T(Vk,Vi), 
assuming  that  there  exists  a  set  of  variables  S  that  does  not  contain  variables 
lying  in  the  paths  in  7  but  contains  a  variable  lying  in  each  path  in  17(14,  V.)  \  7, 
the  partial  effect  PE( 7)  can  be  computed  as 

PEW  -  ~E[Vfdo(s),do(vk)}.  (6.7) 

OVk 

Note  that  such  a  set  S  may  not  exist  for  some  7. 


6.3  Identifying  Causal  Effects 


Next,  we  show  how  to  compute  those  expectations  with  respect  to  post-intervention 
distributions  given  causal  effects  expressed  in  terms  of  the  observed  joint  P(v). 
For  two  variables  Vt  and  1 f  and  a  set  of  variables  S,  the  coefficient  of  Vj  in  the 
linear  regression  of  Vt  on  14  and  S  is  called  a  partial  regression  coefficient ,  and  is 
denoted  by  PviVj.s  (Note  that  the  order  of  the  subscripts  in  /3y  vys  is  important). 
Partial  regression  coefficients  can  be  expressed  in  terms  of  covariance  matrices  as 
follows: 


vvtVj  -  CftSCSgCVjs 


Pvi 


V, ;.S 


°vi  vi 


U  T  r.  c  Vs  q  q  V_V 


(6.8) 


VjS'-'SS'-'Vj8 


where  Css  etc.  represents  covariance  matrices.  Let  S  =  {14, . . . ,  Vim}  and 
Sj  —  S  \  {Vi  }.  We  have  the  following  formula  for  conditional  expectations 


j 


(6.9) 
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Eq.  (6.9)  provides  the  foundation  for  computing  expectations  in  post-intervention 
distributions  expressed  in  terms  of  P(v).  Whenever  a  causal  effect  Pt(s)  is  deter¬ 
mined  as  identifiable  (in  a  nonparametric  model),  we  can  use  Eqs.  (6.3)-(6.7)  to 
compute  the  causal  effects  in  the  corresponding  linear  model. 

Next  we  study  some  examples.  The  “back-door”  criterion  [Pea93]  says  that  if 
a  set  of  variables  Z  satisfies  the  back-door  criterion  relative  to  ( X ,  Y),  then  Px(y ) 
is  identifiable  and  is  given  by 

PM  =  P(y\x,z)P(z).  (6.10) 

Z 

Let  Z  =  {Zi, . . . ,  Zk }  and  Zl  —  Z  \  {Zi}.  Eq.  (6.10)  leads  to 
E[Y\do{x)\  =  ^2EiY\x,z}P(z) 

Z 

=  'Y^iPyx.z  x  "I"  Y.  ^YZj.xzi  Zi)P(z)  (by  Eq.  (6.9)) 

z  i 

= Pyx.zx  (E[Zz)  =  0)  (6.11) 

Therefore,  by  Proposition  2,  if  a  set  of  variables  Z  satisfies  the  back-door  criterion 
relative  to  ( X ,  Y),  then  the  total  effect  of  X  on  Y  is  given  by  Pyx.z-  This  result 
is  given  as  Theorem  5.3.2  in  [PeaOO,  p.  152]. 

Consider  the  “front-door”  criterion  [Pea95a],  which  says  that  if  a  set  of  vari¬ 
ables  Z  satisfies  the  front-door  criterion  relative  to  (X,Y),  then  Px(y)  is  identi¬ 
fiable  and  is  given  by 


Px(y)  =  Y2  P(z\x)  V  P{y\x\  z)P(x’)  (6.12) 

Z  X1 

Let  Z  =  {Zi, . . . ,  Zk}  and  Zl  =  Z  \  {Zi}.  We  have 
E[Y\do(x)\  =  ^2  P(AX)  ElY\x’>  z)p(x') 

z  x' 

—  ^2  p(z\x)  x' Y  p yz  i.xzi  Zi)P(x') 

z  x'  i 

=  ^2p(z\x)j2^Yzi  .xz*  zi 

z  i 

=  "^2  ^YZt.XZiE[Zi\x] 

i 

=  y:  PyZj.xz'PzjX  x  (6.13) 
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Therefore,  if  a  set  of  variables  Z  satisfies  the  front-door  criterion  relative  to 
(. X ,  Y),  then  the  total  effect  of  X  on  Y  is  given  by  ]Yi:  Pyz^xz^ZiX- 

In  general,  the  identifiability  of  Px(y)  may  be  decided  by  using  Theorem  17 
or  Algorithm  4  in  Chapter  5.3.  We  can  then  identify  the  total  effect  of  X  on 
Y  by  computing  expectations  using  Eq.  (6.9).  Next  we  show  a  few  examples 
in  which  we  can  identify  path  coefficients  by  identifying  direct  effect  Pp(ly  (y). 
Consider  the  problem  of  identifying  direct  effects  on  Y  in  Figure  5.13(a).  It  is 
shown  in  Chapter  5.4.3  that  the  direct  effect  PXlX2(y )  is  identifiable  and  is  given 
in  Eq.  (5.144)  rewritten  in  the  following 

PXlX2(y)  =  ]TP(y  \xux2,z)P{z\Xl).  (6.14) 

z 

We  have 

E{Y\do(xi,  x2)\  =  'y^X^YXi.x2z  xi  +  Pyx2.xxz  x2  +  Pyz.XiX2  z)P{z\x  1) 

Z 

—  Pyxi.x2z  xi  +  Pyx2.XiZ  x2  +  /3yz.x1x2Pzx1  xl 
=  (PyXi.X2Z  +  PyZ.XiX2PzXi)x1  +  PyX2.X!Z  x2 ■  (6.15) 

Therefore,  by  Proposition  3,  we  have  that  the  direct  effects  of  X\  and  X2  on  Y 
are  both  identifiable  and  are  given  by 

cyx  i  =  Pyxx.x2z  +  Pyz.XiX2Pzxx,  (6.16) 

cyx-i  —  Pyx2.XiZ-  (6.17) 

Consider  the  problem  of  identifying  direct  effects  on  Y  in  Figure  5.16.  PZi,z2{y ) 
is  identifiable  and  is  given  in  Eq.  (5.163)  rewritten  in  the  following 

P*iM  =  Yj  P(y\x1,x2,z1 ,  z2)P{x 2\xu  z1)P(x1 ),  (6.18) 

Xi,X2 

which  leads  to 

E[Y\do(zi,  z2)] 

=  {Pyx1.x2z1z2  x\  +  Pyx2.x1z1z2  x2  +  /3yz1.x1x2z2  zl 

Xi,X2 

+  !3yz2.x1x2z1  z2)P {x2\xu  zi)P(xi) 

=  PyX2.XiZ1Z2  /XPxtXi.Zj  X1  +  Px2Z1.Xl  Zi)P{xi) 

Xl 

+  PyZ1.X1X2Z2  zl  +  0YZ2.XiX2Zi  z2 

=  (PyZi.XiX2Z2  +  (3YX2.XiZiZ2Px2Zi.Xi)zl  +  PyZ2.XiX2Zi  z2  ■  (6.19) 
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Therefore  we  obtain  that  the  direct  effects  of  Z\  and  Z2  on  Y  are  both  identifiable 
and  are  given  by 


cyzj  —  Pyz1.XiX2z2  +  Pyx2x1z1z2Px2Zi.x1,  (6.20) 

c-yz2  =  Pyz2.XiX2Zi ■  (6.21) 

This  method  of  translating  identifiability  results  in  nonparametric  models  to 
linear  models  provides  a  new  tool  for  the  identification  problem  in  linear  models. 
First,  the  method  may  identify  some  path  coefficients  that  can  not  be  identi¬ 
fied  by  the  instrumental  variable  approach.  Second,  the  method  may  directly 
identify  some  total  effects  and  partial  effects  even  though  some  individual  path 
coefficients  involved  are  not  identifiable,  while  standard  instrumental  variable 
approach  focuses  on  the  identification  of  individual  path  coefficients. 


6.4  Identifying  Causal  Effects  Systematically 

For  the  purpose  of  identifying  individual  path  coefficients,  we  suggest  the  follow¬ 
ing  systematic  process.  Let  a  topological  order  over  V  be  V\  <  . . .  <  Vn,  and 
let  =  {Vu...,Vj},  j  =  1,...  ,  n.  For  j  from  2  to  n,  at  each  step,  we  con¬ 
sider  the  subgraph  Gvu)  and  try  to  identify  the  path  coefficients  associated  with 
links  pointing  at  Vj.  At  step  j,  the  causal  effects  involving  Vj  can  be  computed 
as  follows.  Assuming  that  Vj  is  in  the  c-component  Sj  of  Gvu),  by  Lemma  7, 
Q[Sj]  =  Pv\Sj(sj )  is  identifiable.  Therefore,  we  can  obtain  some  partial  effects  on 
Vj  by  computing  E[Vj\do(v\  s,-))].  We  may  get  further  information  about  causal 
effects  on  Vj  by  looking  for  subset  S  of  Sj  that  contains  Vj  such  that  Q [5]  is  iden¬ 
tifiable.  The  maximum  information  is  achieved  by  finding  the  minimum  subset 
Smin  of  Sj  that  contains  Vj  such  that  Q[Smin\  =  Pv\Smin(smin)  is  identifiable  and 
computing  E[Vj\do(v  \  smin)]. 

Such  a  minimum  set  can  be  found  out  by  slightly  modifying  the  function 
Identify (C,T,Q)  in  Figure  5.9  used  to  determine  if,  for  any  set  C  C  T,  Q[C]  is 
computable  from  Q[T\.  The  modified  function  Identify _Min(S,  T,  Q)  is  given  in 
Figure  6.2,  which,  given  Q[T]  and  a  set  S  C  T,  finds  the  minimum  set  Smin  that 
contains  S  such  that  Q[Smin]  is  computable  from  Q[T). 

Therefore,  at  step  j,  we  call  the  function  Identify _Min({V)},  Sj,  Q[5j])  to  find 
out  Smin  and  Q[Smin ]  =  Pv\Smin{smin) ■  Then  we  compute  E[Vj\do(v  \  smin )]  to 
get  some  partial  effects  on  Vr  Let  Z  =  (Zi, . . . ,  Z^j  be  the  set  of  variables  in 
V  \  Smin  such  that  for  each  Z,  there  exists  a  directed  path  from  Z%  to  Vj  that 
does  not  pass  any  other  variables  in  V  \  Smm.  Let  7 j  be  the  set  of  directed  paths 
from  Zi  to  Vj  that  do  not  pass  any  other  variables  in  V  \  Smin.  By  Proposition  4, 
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Function  Identify _Min  (S.  T,  Q) 

INPUT:  S  C  T  C  V.  Q  =  Q[T).  Assuming  GT  is  composed  of  one  single 
c-component. 

OUTPUT:  The  minimum  set  Smin  0  S  such  that  Q[Smin]  is  computable  from 

Q\n 


Let  A  =  Ati{S)gt- 

•  IF  A  =  S,  output  Smin  —  S  and  QfS']  =  Yhr\s  Q- 

•  IF  A  —  T,  output  Smin  =  T  and  Q[T). 

•  IF  S  C  ACT 

1.  Assume  that,  in  Ga,  S  is  contained  in  a  c-component  V . 

2.  Compute  Q[T']  from  Q[A]  =  J2t\a  Q  by  Lemma  11. 

3.  Output  Identify _Min (S',  T',  Q\T'}). 

Figure  6.2:  A  function  finding  the  minimum  set  Smin  2  S  such  that  Q[Smin)  is 
identifiable  from  Q[T], 

the  partial  effect  PE{pfi)  is  identifiable  and  is  given  by 

PE( 7l)  -  -^E[Vj\do(v\smin)\.  (6.22) 

Let  the  set  of  parents  of  V}  be  Pa3  =  {Yi, . . . ,  Yt}.  Then  the  partial  effect  PE^i) 
as  a  summation  of  products  of  path  coefficients  along  some  paths  from  Zi  to  V) 
can  be  decomposed  into 

PE{ll)=  Y.  PE(6im)cVjYm,  for  ZigPaj,  (6.23) 

m,Ym  €-Smin 

or  when  Zi  is  a  parent  of  V), 

PE(ji)  =  cVjYi  +  Y  PE^im)cVjYm,  for  Zi  =  Yi,  (6.24) 

where  Sim  is  the  set  of  directed  paths  from  Zl  to  Ym  that  do  not  pass  any  other 
variables  in  V  \  Smin.  The  summation  is  for  Ym  G  Smm  because  7 ,  only  contains 
paths  that  do  not  pass  variables  in  V  \  Smin.  Since  Pv\Srnin{smin)  is  identifiable, 
by  Proposition  4,  the  partial  effect  PE(5im)  is  identifiable  and  is  given  by 

PE(5im)  =  —E[Ym\do(v  \  smm))l  for  Ym  G  Smin,  (6.25) 
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Figure  6.3:  Subgraphs  for  identifying  path  coefficients  in  G. 


which  would  have  been  identified  before  the  step  j.  From  Eqs.  (6.22)— (6.25) ,  we 
conclude  that,  at  step  j,  we  will  obtain  a  set  of  equations  which  are  linear  in  the 
set  of  path  coefficients  cy;.ym  associated  with  links  pointing  at  Vj  and  in  which 
those  path  coefficients  are  the  only  unknowns. 

In  summary,  at  step  j,  we  do  the  following 

1.  Find  the  c-component  Sj  of  Gv o> 

2.  Find  the  expression  for  Q[Sj]  by  Lemma  7. 

3.  Call  the  function  Identify _Min({V)},  Sj,  Q[Sj})  to  find  out  Smin  and  Q[Smin]  = 

Pv\Smin  ( smin )■ 

4.  Compute  E[Vj\do(v  \  sm*n) ]  to  get  a  set  of  equations  linear  in  path  coeffi¬ 
cients  associated  with  links  pointing  at  Vj. 

5.  Try  to  solve  the  set  of  linear  equations. 

Next,  we  demonstrate  this  procedure  by  some  examples.  Consider  the  iden¬ 
tification  problem  in  Figure  6.3(a).  The  only  admissible  order  of  variables  is 
X  <  Zi  <  Z2  <  Y  <  W.  At  step  1,  we  consider  the  subgraph  in  Figure  6.3(b). 

It  is  obvious  that  Px(z\)  =  P(z\\x),  and  we  obtain 

cZlX  =  E[Z1\do(x)\/x  -  j3ZlX.  (6.26) 

At  step  2,  we  consider  the  subgraph  in  Figure  6.3(c).  Z2  is  in  the  c-component 
{X,  Z2}  and  Lemma  7  gives 

Q[{X,  Z2}}  =  P{z2\Zl,x)P{x).  (6.27) 
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Calling  the  function  Identify _Min({Z2},  { X ,  Z2 },  Q[{X,  Z2}}),  we  obtain 

Q[{z2}}  =  PZM 2)  =  '*r,P(z2\z1,x)P(x).  (6.28) 

X 


Therefore, 


cz2zx  =  E[Z2\do{zl)\/ zi 

= y^(Pz2zi.x  %1  +  Pz2X.Zi 

X 

—  Pz2Zi.x-  (6.29) 

At  step  3,  we  consider  the  subgraph  in  Figure  6.3(d).  Y  is  in  the  c-component 
{Y,  Zi}  and  Lemma  7  gives 

Q[{Y,Zi}]  =  PXZ2(y,Zl)  =  P(y\z2,zux)P(z1\x).  (6.30) 

Calling  the  function  Identify _Min({F},  {F,  Zi},  Q[{Y,  Zi}])  returns  {F,  Z\}  as  the 
minimum  set  (Q[{F}j  is  not  identifiable).  We  then  compute  the  expectation 

E[Y\do(x,  z2 )]  =  S^Jk^yz2.ZiX  z2  +  Pyzi.z2 x  z\  +  Pyx.z2Zi  x)P(zi\x) 

Zl 

—  Pyz2.ZiX  z2  +  (/3yz1.z2xPz1x  +  Pyx.z2z i)x  (6.31) 

Therefore,  we  obtain  the  path  coefficient 

cyz2  =  Pyz2.z  iX,  (6.32) 


and  the  following  partial  effect 

Cyx  +  CZ1XCYZ1  =  fivX.Z-iZi  +  PzixPyZi.Z2Xi  (6.33) 

where  cZlx  is  identified  in  Eq.  (6.26). 

Finally,  at  the  last  step,  we  consider  the  graph  in  Figure  6.3(a).  IF  is  in  the 
c-component  {W,Y,Zij  and  Lemma  7  gives 

Q[{W,Y,Z]}\  =  PXZ2(w,y,Zi)  -  P(w\y,z2,zux)P(y\z2,Zi,x)P(z1\x).  (6.34) 

Calling  the  function  Identify_Min({W},  {W,  F,  Zi},  Q[{W,  F,  Zi}})  returns 
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{W,  Y,  Zx}  as  the  minimum  set.  We  then  compute  the  expectation 
E[W\d,o(x,  Z2)} 

—  S^j($wy.z2z1x  y  +  Pwz2.yzxx  -I-  Pwzi.yz2x  zi  +  Pwx.yz2Zi  x) 
y,z  1 

P(y\z2,z1,x)P(zi\x) 

=  PwY.z2zlxE[Y\do(x ,  z2)\  +  Pwz2.yzi x  z2  +  Pwz1.yz2xPz1x  x 

+  PwX.YZ2Zi  X 

—  (, PwY.Z2ZxxPy  Z2.Z\X  +  PwZ2.YZix)z2 
+  [/3wY.Z2Zix(pYZi.Z2XpZiX  +  PyX.Z2Zi ) 

+  PwZi.yz2xPziX  +  Pwx.yz2Zi ]  x ,  (substitute  (6.31)  in)  (6.35) 

from  which  we  obtain  the  total  effect  of  Z2  on  W : 


cyz2cwy  —  Pwy.z2z1xPyz2.ZiX  +  Pwz2.yziX >  (6.36) 

and  the  following  partial  effect  of  X  on  W: 


Cwx  +  (Cy  X  +  cZiXcYZi)cWY 

—  Pwy.z2Zix{Pyzi.z2xPziX  +  Pyx.z2Zi)  +  Pw  zx.yz2x  Pzxx  +  Pwx.yz2zx •  (6.37) 


The  path  coefficient  cyz2  is  identifiable  and  is  given  in  (6.32),  and  therefore  by 
Eq.  (6.36),  the  path  coefficient  cWY  is  identifiable  and  is  given  by 


cwy  —  Pwy.z2zxx  + 


PwZ2.YZxX 

PyZ2.ZxX 


(6.38) 


Then  from  Eqs.  (6.37),  (6.33),  and  (6.38),  the  path  coefficient  cwx  is  identifiable 
and  is  given  by 


cwx  =  Pwx.yz2zx  +  Pwzx.yz2xPzxx  ~  :^JZ--YZ'‘X  (pYZx.z2xPzxx  +  Pyx.z2zx) 

PYZ2.ZiX 


(6.39) 


6.5  Conclusion 

We  show  how  the  identifi ability  results  in  nonparametric  models  presented  in 
Chapter  5  can  be  used  to  identify  causal  effects  in  linear  models.  The  method 
may  directly  identify  some  total  effects  and  partial  effects  even  though  some 
individual  path  coefficients  involved  are  not  identifiable.  The  method  is  useful  in 
models  with  few  bidirected  (confounding)  links. 
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CHAPTER  7 


Probabilities  of  Causation:  Bounds  and 
Identification 


7.1  Introduction 

Assessing  the  likelihood  that  one  event  was  the  cause  of  another  guides  much  of 
what  we  understand  about  (and  how  we  act  in)  the  world.  For  example,  few  of  us 
would  take  aspirin  to  combat  headache  if  it  were  not  for  our  conviction  that,  with 
high  probability,  it  was  aspirin  that  “actually  caused”  relief  in  previous  headache 
episodes.  Likewise,  according  to  common  judicial  standard,  judgment  in  favor  of 
plaintiff  should  be  made  if  and  only  if  it  is  “more  probable  than  not”  that  the 
defendant’s  action  was  a  cause  for  the  plaintiff’s  injury  (or  death).  This  chapter 
deals  with  the  question  of  estimating  the  probability  of  causation  from  statistical 
data. 

Causation  has  two  faces,  necessary  and  sufficient.  The  most  common  con¬ 
ception  of  causation  -  that  the  effect  E  would  not  have  occurred  in  the  ab¬ 
sence  of  the  cause  C  -  captures  the  notion  of  “necessary  causation” .  Competing 
notions  such  as  “sufficient  cause”  and  “necessary-and-sufficient  cause”  are  also 
of  interest  in  a  number  of  applications,  and  this  chapter  analyzes  the  relation¬ 
ships  among  the  three  notions.  Although  the  distinction  between  necessary  and 
sufficient  causes  goes  back  to  J.S.  Mill  [Mil43],  it  has  received  semi-formal  ex¬ 
plications  only  in  the  1960s  -  via  conditional  probabilities  [Goo61]  and  logical 
implications  [Mac65].  These  explications  suffer  from  basic  semantical  difficulties 
[Kim71]  [PeaOO,  pp.  249-256,  313-316],  and  they  do  not  yield  effective  proce¬ 
dures  for  computing  probabilities  of  causes.  This  chapter  defines  probabilities  of 
causes  in  a  language  of  counterfactuals  that  is  based  on  a  simple  model-theoretic 
semantics  (to  be  formulated  in  Section  7.2). 

[RG89]  gave  a  counterfactual  definition  for  the  probability  of  necessary  causa¬ 
tion  taking  counterfactuals  as  primitives,  and  assuming  that  one  is  in  possession  of 
a  consistent  joint  probability  function  on  both  ordinary  and  counterfactual  events. 
[Pea99]  gave  definitions  for  the  probabilities  of  necessary  or  sufficient  causation 
(or  both)  based  on  structural  model  semantics,  which  defines  counterfactuals  as 
quantities  derived  from  modifiable  sets  of  functions  [GP97,  GP98,  Hal98,  PeaOO]. 


118 


The  structural  models  semantics,  as  we  shall  see  in  Section  7.2,  leads  to  effective 
procedures  for  computing  probabilities  of  counterfactual  expressions  from  a  given 
causal  theory  [BP94,  BP95].  Additionally,  this  semantics  can  be  characterized 
by  a  complete  set  of  axioms  [GP98,  Hal98],  which  we  will  use  as  inference  rules 
in  our  analysis. 

The  central  aim  of  this  chapter  is  to  estimate  probabilities  of  causation  from 
frequency  data,  as  obtained  in  experimental  and  observational  statistical  studies. 
In  general,  such  probabilities  are  non-identifiable,  that  is,  non-estimable  from 
frequency  data  alone.  One  factor  that  hinders  identifiability  is  confounding  - 
the  cause  and  the  effect  may  both  be  influenced  by  a  third  factor.  Moreover, 
even  in  the  absence  of  confounding,  probabilities  of  causation  are  sensitive  to  the 
data-generating  process,  namely,  the  functional  relationships  that  connect  causes 
and  effects  [RG89,  BP94].  Nonetheless,  useful  information  in  the  form  of  bounds 
on  the  probabilities  of  causation  can  be  extracted  from  empirical  data  without 
actually  knowing  the  data-generating  process.  These  bounds  improve  when  data 
from  observational  and  experimental  studies  are  combined.  Additionally,  under 
certain  assumptions  about  the  data-generating  process  (such  as  exogeneity  and 
monotonicity),  the  bounds  may  collapse  to  point  estimates,  which  means  that 
the  probabilities  of  causation  are  identifiable  -  they  can  be  expressed  in  terms  of 
probabilities  of  observed  quantities.  These  estimates  will  be  recognized  as  familiar 
expressions  that  often  appear  in  the  literature  as  measures  of  attribution.  Our 
analysis  thus  explicates  the  assumptions  about  the  data-generating  process  that 
must  be  ascertained  before  those  measures  can  legitimately  be  interpreted  as 
probabilities  of  causation. 

The  analysis  of  this  chapter  leans  heavily  on  results  reported  in  [Pea99]  [PeaOO, 
pp.  283-308].  Pearl  derived  bounds  and  identification  conditions  under  certain 
assumptions  of  exogeneity  and  monotonicity,  and  this  chapter  improves  on  Pearl’s 
results  by  narrowing  his  bounds  and  weakening  his  assumptions.  In  particular, 
we  show  that  for  most  of  Pearl’s  results,  the  assumption  of  strong  exogeneity  can 
be  replaced  by  weak  exogeneity  (to  be  defined  in  Section  7.4.3).  Additionally, 
we  show  that  the  point  estimates  that  Pearl  obtained  under  the  assumption  of 
monotonicity  (Definition  19)  constitute  valid  lower  bounds  when  monotonicity 
is  not  assumed.  Finally,  we  prove  that  the  bounds  derived  by  Pearl,  as  well  as 
those  provided  in  this  chapter  are  sharp ,  that  is,  they  cannot  be  improved  without 
strengthening  the  assumptions. 

The  rest  of  the  chapter  is  organized  as  follows.  Section  7.2  reviews  the  struc¬ 
tural  model  semantics  of  actions,  counterfactuals  and  probability  of  counterfactu- 
als.  In  Section  7.3  we  present  formal  definitions  for  the  probabilities  of  causation 
and  briefly  discuss  their  applicability  in  epidemiology,  artificial  intelligence,  and 
legal  reasoning.  In  Section  7.4  we  systematically  investigate  the  maximal  infor- 
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mation  (about  the  probabilities  of  causation)  that  can  be  obtained  under  various 
assumptions  and  from  various  types  of  data.  Section  7.5  illustrates,  by  example, 
how  the  results  presented  in  this  chapter  can  be  applied  to  resolve  issues  of  attri¬ 
bution  in  legal  settings.  Section  7.6  illustrates  the  use  of  our  results  in  personal 
decision  making.  Section  7.7  concludes  the  chapter. 

7.2  Structural  Model  Semantics 

In  Chapters  1-5,  we  assumed  probabilistic  relations  between  variables  in  the 
model.  In  this  chapter,  we  assume  deterministic,  functional  relations  between 
variables,  and  the  causal  model  will  be  called  functional,  which,  in  addition  to 
interventions,  supports  counterfactual  readings.  This  section  presents  a  brief 
summary  of  the  structural-equation  semantics  of  counterfactuals  as  defined  in 
[BP95,  GP97,  GP98,  Hal98].  Related  approaches  have  been  proposed  in  [SR66] 
(see  footnote  5)  and  [Rob86].  For  detailed  exposition  of  the  structural  account 
and  its  applications  see  [PeaOO] . 

Structural  models  are  generalizations  of  the  structural  equations  used  in  engi¬ 
neering,  biology,  economics  and  social  science.1  World  knowledge  is  represented 
as  a  collection  of  stable  and  autonomous  relationships  called  “mechanisms,”  each 
represented  as  a  function,  and  changes  due  to  interventions  or  hypothetical  even¬ 
tualities  are  treated  as  local  modifications  of  these  functions. 

A  causal  model  is  a  mathematical  object  that  assigns  truth  values  to  sentences 
involving  causal  relationships,  actions,  and  counterfactuals.  We  will  first  define 
functional  causal  models,  then  discuss  how  causal  sentences  are  evaluated  in  such 
models.  We  will  restrict  our  discussion  to  recursive  (or  feedback-free)  models; 
extensions  to  non-recursive  models  can  be  found  in  [GP97,  GP98,  PIal98]. 

Definition  6  ( functional  causal  model ) 

A  functional  causal  model  is  a  triple 

M  —  <U,V,F> 


where 

(i)  U  is  a  set  of  variables,  called  exogenous.  (These  variables  will  represent  back¬ 
ground  conditions,  that  is,  variables  whose  values  are  determined  outside  the 
model.) 

1  Similar  models,  called  “neuron  diagrams”  [Lew86,  Hal02]  are  used  informally  by  philoso¬ 
phers  to  illustrate  chains  of  causal  processes. 
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(ii)  V  is  an  ordered  set  {Vx,  V2,  ■  ■  ■ ,  Vn}  of  variables,  called  endogenous.  (These 
represent  variables  that  are  determined  in  the  model,  namely,  by  variables 
in  U  U  V .) 

(Hi)  F  is  a  set  of  functions  {/1,  /2,  •  •  ■ ,  fn}  where  each  ft  is  a  mapping  from 
U  x  (Vi  x  ...  x  Vi- 1)  to  Vi .  In  other  words,  each  fi  tells  us  the  value  of 
\\  given  the  values  ofU  and  all  predecessors  of  Vi.  Symbolically,  the  set  of 
equations  F  can  be  represented  by  writing  2 

V{  —  fi(pai,uf)  i  —  l,...,n 

where  pa^  is  any  realization  of  the  unique  minimal  set  of  variables  PAi  in 
V  ( connoting  parents)  sufficient  for  representing  fi3.  Likewise,  Ui  C  U 
stands  for  the  unique  minimal  set  of  variables  in  U  that  is  sufficient  for 
representing  fi. 


Every  functional  causal  model  M  can  be  associated  with  a  directed  graph, 
G(M),  in  which  each  node  corresponds  to  a  variable  in  V  and  the  directed  edges 
point  from  members  of  PAi  toward  V. \  (by  convention,  the  exogenous  variables 
are  usually  not  shown  explicitly  in  the  graph).  We  call  such  a  graph  the  causal 
graph  associated  with  M.  This  graph  merely  identifies  the  endogenous  variables 
PAi  that  have  direct  influence  on  each  Vt  but  it  does  not  specify  the  functional 
form  of  f . 

Basic  of  our  analysis  are  sentences  involving  actions  or  external  interventions, 
such  as,  “p  will  be  true  if  we  do  q"  where  q  is  any  elementary  proposition.  To 
evaluate  such  sentences  we  need  the  notion  of  '‘submodel.” 

Definition  7  ( Submodel ) 

Let  M  be  a  functional  causal  model,  X  be  a  set  of  variables  in  V,  and  x  be  a 
particular  assignment  of  values  to  the  variables  in  X .  A  submodel  Mx  of  M  is 
the  functional  causal  model 


Mx  =  <U,V,FX> 


where 

Fx  =  {fi-.VitX}  U{X  =  x}  (7.1) 


2  We  use  capital  letters  (e.g.,  X ,  Y)  as  names  of  variables  and  sets  of  variables,  and  lower¬ 
case  letters  (e.g.,  x,  y)  for  specific  values  (called  realizations)  of  the  corresponding  variables. 

3  A  set  of  variables  X  is  sufficient  for  representing  a  given  function  y  =  /( x,  z)  if  f  is  trivial 
in  Z — that  is,  if  for  every  x,z,z'  we  have  f{x,z)  =  f(x,z'). 


121 


In  words,  Fx  is  formed  by  deleting  from  F  all  functions  f  corresponding  to 
members  of  set  X  and  replacing  them  with  the  set  of  constant  functions  X  —  x. 

Submodels  represent  the  effect  of  actions  and  hypothetical  changes,  including 
those  dictated  by  counterfactual  antecedents.  If  we  interpret  each  function  /*  in 
F  as  an  independent  physical  mechanism  and  define  the  action  do{ X  =  x )  as 
the  minimal  change  in  M  required  to  make  X  —  x  hold  true  under  any  u,  then 
Mx  represents  the  model  that  results  from  such  a  minimal  change,  since  it  differs 
from  M  by  only  those  mechanisms  that  directly  determine  the  variables  in  X. 
The  transformation  from  M  to  Mx  modifies  the  algebraic  content  of  F,  which  is 
the  reason  for  the  name  modifiable  structural  equations  used  in  [GP98].4 

Definition  8  ( Effect  of  action)  , 

Let  M  be  a  functional  causal  model,  X  be  a  set  of  variables  in  V,  and  x  be  a 
particular  realization  of  X.  The  effect  of  action  do(X  =  x)  on  M  is  given  by  the 
submodel  Mx. 

Definition  9  ( Potential  response ) 

Let  Y  be  a  variable  in  V,  let  X  be  a  subset  of  V,  and  let  u  be  a  particular  value 
of  U ■  The  potential  response  of  Y  to  action  do(X  =  x)  in  situation  u,  denoted 
Yx(u),  is  the  (unique)  solution  for  Y  of  the  set  of  equations  Fx. 

We  will  confine  our  attention  to  actions  in  the  form  of  do(X  —  x).  Conditional 
actions,  of  the  form  udo(X  =  x)  if  Z  —  z”  can  be  formalized  using  the  replacement 
of  equations  by  functions  of  Z,  rather  than  by  constants  [Pea94],  We  will  not 
consider  disjunctive  actions,  of  the  form  udo( X  =  x  or  X  =  x')” ,  since  these 
complicate  the  probabilistic  treatment  of  counterfactuals. 

Definition  10  ( Counterfactual ) 

Let  Y  be  a  variable  in  V ,  and  let  X  be  a  subset  of  V .  The  counterfactual  ex¬ 
pression  “The  value  that  Y  would  have  obtained,  had  X  been  x”  is  interpreted  as 
denoting  the  potential  response  Yx(u). 

Definition  5  thus  interprets  the  counterfactual  phrase  “had  X  been  x”  in  terms 
of  a  hypothetical  external  action  that  modifies  the  actual  course  of  history  and 
enforces  the  condition  “X  =  x”  with  minimal  change  of  mechanisms.  This  is 
a  crucial  step  in  the  semantics  of  counterfactuals  [BP94],  as  it  permits  x  to 

4  Structural  modifications  date  back  to  [Mar50]  and  [Sim53].  An  explicit  translation  of 
interventions  into  “wiping  out”  equations  from  the  model  was  first  proposed  by  [SW60]  and 
later  used  in  [Fis70],  [Sob90],  [SGS93],  and  [Pea95aj.  A  similar  notion  of  sub-model  is  introduced 
in  [Fin85],  though  not  specifically  for  representing  actions  and  counterfactuals. 
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differ  from  the  actual  value  X(u)  of  X  without  creating  logical  contradiction; 
it  also  suppresses  abductive  inferences  (or  backtracking)  from  the  counterfactual 
antecedent  X  —  x.5 

It  can  easily  be  shown  [GP97]  that  the  counterfactual  relationship  just  defined, 
Yx(u),  satisfies  the  following  two  properties: 

Effectiveness: 

For  any  two  disjoint  sets  of  variables,  Y  and  W,  we  have 

Yyw{u)  =  y.  (7.2) 

In  words,  setting  the  variables  in  W  to  w  has  no  effect  on  Y,  once  we  set  the 
value  of  Y  to  y. 

Composition: 

For  any  two  disjoint  sets  of  variables  X  and  W,  and  any  set  of  variables  Y, 

Wx(u)  =  w  ==>•  Yxw(u)  —  Yx(u).  (7.3) 

In  words,  once  we  set  X  to  x,  setting  the  variables  in  W  to  the  same  values, 
w,  that  they  would  attain  (under  x)  should  have  no  effect  on  Y.  Furthermore, 
effectiveness  and  composition  are  complete  whenever  M  is  recursive  (i.e. ,  G(M) 
is  acyclic)  [GP98,  Hal98],  that  is,  every  property  of  counterfactuals  that  follows 
from  the  structural  model  semantics  can  be  derived  by  repeated  application  of 
effectiveness  and  composition. 

A  corollary  of  composition  is  a  property  called  consistency  by  [Rob87]: 

(X(u)  =  x)  =>  (Yx(u)  =  Y(u))  (7.4) 

Consistency  states  that,  if  in  a  certain  context  u  we  find  variable  X  at  value  x,  and 
we  intervene  and  set  X  to  that  same  value,  x,  we  should  not  expect  any  change 
in  the  response  variable  Y.  This  property  will  be  used  in  several  derivations  of 
Section  7.3  and  7.4. 

The  structural  formulation  generalizes  naturally  to  probabilistic  systems,  as 
is  seen  below. 

Definition  11  (Probabilistic  functional  causal  model ) 

A  probabilistic  functional  causal  model  is  a  pair 

<  M,  P(u)  > 

where  M  is  a  functional  causal  model  and  P(u )  is  a  probability  function  defined 
over  the  domain  of  U. 

5Simon  and  Reseller  [SR66,  p.  339]  did  not  include  this  step  in  their  account  of  counterfac¬ 
tuals  and  noted  that  backward  inferences  triggered  by  the  antecedents  can  lead  to  ambiguous 
interpretations. 
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P(u),  together  with  the  fact  that  each  endogenous  variable  is  a  function  of 
U,  defines  a  probability  distribution  over  the  endogenous  variables.  That  is,  for 
every  set  of  variables  Y  C  V,  we  have 

P(y)  t  P(Y  =  y)=  Y,  P(“)  (7-5) 

{u  I  Y(u)=y} 

The  probability  of  counterfactual  statements  is  defined  in  the  same  manner, 
through  the  function  Yx(u)  induced  by  the  submodel  Mx.  For  example,  the 
causal  effect  of  x  on  y  is  defined  as: 

P(YX  =  y)=  Y  (7-6) 

{u  I  Y3S{u)=y) 


Likewise,  a  probabilistic  functional  causal  model  defines  a  joint  distribution 
on  counterfactual  statements,  i.e. ,  P(YX  —  y,Zw  —  z)  is  defined  for  any  sets  of 
variables  Y,  X,  Z,  W,  not  necessarily  disjoint.  In  particular,  P(YX  =  y,X  =  x ') 
and  P{YX  —  y ,  Yx>  =  y')  are  well  defined  for  x  7^  x',  and  are  given  by 

P(Yx  =  y,X  =  x')  =  Y  p(u)  (P7) 

{u\Yx{u)=y  k  X(u)=x'} 

and 

P(Yx  =  y,Yx,  =  y')=  Y  (7-8) 

{«  |  Yx  (u)=y  k  Yx,(u)=y'} 


When  x  and  x'  are  incompatible,  Yx  and  Yx>  cannot  be  measured  simultane¬ 
ously,  and  it  may  seem  meaningless  to  attribute  probability  to  the  joint  statement 
“Y  would  be  y  if  X  =  x  and  Y  would  be  y'  if  X  =  x'.”  Such  concerns  have  been 
a  source  of  recent  objections  to  treating  counterfactuals  as  jointly  distributed 
random  variables  [Daw97] .  The  definition  of  Yx  and  Yxi  in  terms  of  two  distinct 
submodels,  driven  by  a  standard  probability  space  over  U,  demonstrates  that  joint 
probabilities  of  counterfactuals  have  solid  mathematical  and  conceptual  under¬ 
pinning  and,  moreover,  these  probabilities  can  be  encoded  rather  parsimoniously 
using  P{u)  and  F. 

In  particular,  the  probabilities  of  causation  analyzed  in  this  chapter  (see 
Eqs.  (7.10)-(7.12))  require  the  evaluation  of  expressions  of  the  form  P(YX>  = 
y'\X  =  x,  Y  —  y)  with  x  and  y  incompatible  with  x'  and  y\  respectively.  Eq.  (7.7) 
allows  the  evaluation  of  this  quantity  as  follows: 


P(Yx,^y'\X  =  x,Y  =  y) 


P(yx'  =  y',  X  =  X,  V  =  y) 
P(X  =  x,Y  =  y) 

Yp(Yx'(u )  =  y')P(u\x,y) 


(7.9) 
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In  other  words,  we  first  update  P(u)  to  obtain  P(u\x,  y ),  then  we  use  the  updated 
distribution  P{u\x,y)  to  compute  the  expectation  of  the  propositional  variable 
Yx,(u)=y'6 


7.3  Probabilities  of  Causation:  Definitions 

In  this  section,  we  present  the  definitions  for  the  three  aspects  of  causation  as 
defined  in  [Pea99].  We  use  the  counterfactual  language  and  the  structural  model 
semantics  introduced  in  Section  7.2.  For  notational  simplicity,  we  limit  the  discus¬ 
sion  to  binary  variables;  extension  to  multi-valued  variables  are  straightforward 
(see  [PeaOO],  page  286,  footnote  5). 

Definition  12  ( Probability  of  necessity  {PN)) 

Let  X  and  Y  be  two  binary  variables  in  a  functional  causal  model  M,  let  x  and 
y  stand  for  the  propositions  X  —  true  and  Y  —  true,  respectively,  and  x'  and  y' 
for  their  complements.  The  probability  of  necessity  is  defined  as  the  expression 

PN  =  P(Yxi  =  false  \  X  =  true,  Y  =  true ) 

=  P(y'x'\x,y)  (7.10) 


In  other  words,  PN  stands  for  the  probability  that  event  y  would  not  have  oc¬ 
curred  in  the  absence  of  event  x,  y'x, ,  given  that  x  and  y  did  in  fact  occur.7 

This  quantity  has  applications  in  epidemiology,  legal  reasoning,  and  artificial 
intelligence  (AI).  Epidemiologists  have  long  been  concerned  with  estimating  the 
probability  that  a  certain  case  of  disease  is  attributable  to  a  particular  exposure, 
which  is  normally  interpreted  counterfactually  as  “the  probability  that  disease 
would  not  have  occurred  in  the  absence  of  exposure,  given  that  disease  and  expo¬ 
sure  did  in  fact  occur.”  This  counterfactual  notion,  which  Robins  and  Greenland 
(1989)  called  the  “probability  of  causation”,  measures  how  necessary  the  cause 

6 In  our  deterministic  model,  P(Yxfu)  =  y')  takes  on  the  values  zero  and  one,  but  in  models 
involving  intrinsic  nondeterminism  (see  Section  7.7),  or  memoryless  stochastic  fluctuations, 
P{Yx'(u)  =  y')  expresses  the  residual  uncertainty  in  Y,  under  the  setting  X  =  x1,  in  situation 
U  =  u.  Eq.  (7.9)  then  captures  the  uncertainty  associated  with  the  effect  of  action  do(X  —  x'), 
conditioned  on  the  pre-action  evidence  X  =  x  and  Y  =  y. 

7Note  a  slight  change  in  notation  relative  to  that  used  Section  7.2.  Lower  case  letters  (e.g., 
x,y )  denoted  arbitrary  values  of  variables  in  Section  7.2,  and  now  stand  for  propositions  (or 
events).  Note  also  the  abbreviations  yx  for  Yx  =  true  and  y'x  for  Yx  —  false.  Readers  accustomed 
to  writing  “A  >  Bv  for  the  counterfactual  “B  if  it  were  A”  can  translate  Eq.  (7.10)  to  read 

PN  =  P(x'  >y’\x,y). 
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is  for  the  production  of  the  effect.  It  is  used  frequently  in  lawsuits,  where  legal 
responsibility  is  at  the  center  of  contention  (see  Section  7.5). 

Definition  13  ( Probability  of  sufficiency  (PS)) 

PS=P(yx\y',x')  (7.11) 


PS  measures  the  capacity  of  x  to  produce  y  and,  since  “production”  implies  a 
transition  from  the  absence  to  the  presence  of  x  and  y,  we  condition  the  prob¬ 
ability  P(yx)  on  situations  where  x  and  y  are  both  absent.  Thus,  mirroring  the 
necessity  of  x  (as  measured  by  PN),  PS  gives  the  probability  that  setting  x  would 
produce  y  in  a  situation  where  x  and  y  are  in  fact  absent. 

PS  finds  applications  in  policy  analysis,  AI,  and  psychology.  A  policy  maker 
may  well  be  interested  in  the  dangers  that  a  certain  exposure  may  present  to  the 
healthy  population  [KFG89].  Counterfactually,  this  notion  is  expressed  as  the 
“probability  that  a  healthy  unexposed  individual  would  have  gotten  the  disease 
had  he/she  been  exposed.”  In  psychology,  PS  serves  as  the  basis  for  Cheng’s 
(1997)  causal  power  theory,  which  attempts  to  explain  how  humans  judge  causal 
strength  among  events.  In  AI,  PS  plays  a  major  role  in  the  generation  of  expla¬ 
nations  [PeaOO,  pp.  221-223]. 

Definition  14  ( Probability  of  necessity  and  sufficiency  ( PNS )) 

PNS  =  P(yx,y'x,)  (7.12) 


PNS  stands  for  the  probability  that  y  would  respond  to  x  both  ways,  and  therefore 
measures  both  the  sufficiency  and  necessity  of  x  to  produce  y. 

As  illustrated  above,  PS  assesses  the  presence  of  an  active  causal  process  ca¬ 
pable  of  producing  the  effect,  while  PN  emphasizes  the  absence  of  alternative 
processes,  not  involving  the  cause  in  question,  that  are  capable  of  explaining  the 
effect.  In  legal  settings,  where  the  occurrence  of  the  cause,  x,  and  the  effect,  y, 
are  fairly  well  established,  PN  is  the  measure  that  draws  most  attention,  and  the 
plaintiff  must  prove  that  y  would  not  have  occurred  but  for  x  [Rob97a],  Still, 
lack  of  sufficiency  may  weaken  arguments  based  on  PN  [Goo93,  MicOO]. 

Although  none  of  these  quantities  is  sufficient  for  determining  the  others,  they 
are  not  entirely  independent,  as  shown  in  the  following  lemma. 
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Lemma  20  The  probabilities  of  causation  satisfy  the  following  relationship: 


PNS  =  P(x,y)PN  +  P(x',y')PS  (7.13) 


Proof  of  Lemma  20 

Using  the  consistency  condition  of  Eq.  .(7.4), 

x=>(yx  =  y),  x'=>(yx>=y),  (7.14) 

we  can  write 

Vx  A  y'x,  =  (yx  A  y'x,)  A  (x  V  x') 

=  (yx  A  x  A  y*/)  V  (yx  A  y(.,  A  x') 

=  (y  A  x  A  y'x,)  V  (yx  Ay' A  x;) 

Taking  probabilities  on  both  sides,  and  using  the  disjointness  of  x  and  x',  we 
obtain: 

P{yx,y'X')  =  P(y'X',x,y)  +  P(yx,x',y') 

=  P(y'x>\x,y)P(x,y)  +  P(yx\x',y')P(x',y') 

which  proves  Lemma  20.  □ 


Definition  15  (. Identifiability ) 

Let  Q(M)  be  any  quantity  defined  on  a  functional  causal  model  M .  Q  is  identifi¬ 
able  in  a  class  M  of  models  iff  any  two  models  M i  and  M2  from  M  that  satisfy 
PMl(v)  —  PmAv)  a^so  satisfy  Q(M 1)  =  Q(M2).  In  other  words,  Q  is  identifi¬ 
able  if  it  can  be  determined  uniquely  from  the  probability  distribution  P(v)  of  the 
endogenous  variables  V. 

The  class  M  that  we  will  consider  when  discussing  identifiability  will  be 
determined  by  assumptions  that  one  is  willing  to  make  about  the  model  under 
study.  For  example,  if  our  assumptions  consist  of  the  structure  of  a  causal  graph 
Go,  M  will  consist  of  all  models  M  for  which  G(M)  =  Gq.  If,  in  addition 
to  Go,  we  are  also  willing  to  make  assumptions  about  the  functional  form  of 
some  mechanisms  in  M,  M  will  consist  of  all  models  M  that  incorporate  those 
mechanisms,  and  so  on. 

Since  all  the  causal  measures  defined  above  invoke  conditionalization  on  y, 
and  since  y  is  presumed  affected  by  x,  the  antecedent  of  the  counterfactual  yx,  we 
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know  that  none  of  these  quantities  is  identifiable  from  knowledge  of  the  structure 
G{M)  and  the  data  P(v)  alone,  even  under  condition  of  no  confounding.  However, 
useful  information  in  the  form  of  bounds  may  be  derived  for  these  quantities  from 
P(v),  especially  when  knowledge  about  causal  effects  P{yx )  and  P(yx')  are  also 
available8.  Moreover,  under  some  general  assumptions  about  the  data-generating 
process,  these  quantities  may  even  be  identified. 

To  formulate  precisely  what  it  means  to  identify  a  counterfactual  quantity 
from  various  types  of  data,  we  now  generalize  Definition  15  to  capture  the  notion 
of  “identification  from  experiments.”  By  experiment  we  mean  a  prescribed  mod¬ 
ification  of  the  underlying  functional  causal  model,  together  with  the  probability 
distribution  that  the  modified  model  induces  on  the  variables  observed  in  the 
experiment. 

Definition  16  ( Identifiability  from  experiments) 

Let  Q(M )  be  any  quantity  defined  on  a  functional  causal  model  M ,  let  Mexp  be  a 
modification  of  M  induced  by  some  experiment,  exp,  and  let  Y  be  a  set  of  variables 
observed  under  exp.  We  say  that  Q  is  identifiable  from  experiment  exp  in  a  class 
M  of  models  iff  any  two  models  Mi  and  M2  from  M  that  satisfy  PM^{y)  — 
PM°*p(y)  also  satisfy  Q(Mi)  =  Q{M2).  In  other  words,  Q  is  identifiable  from  exp 
if  it  can  be  determined  uniquely  from ;  the  probability  distribution  that  the  observed 
variables  Y  attain  under  the  experimental  conditions  created  by  exp. 

In  the  sequel,  we  will  consider  standard  controlled  experiments,  in  which  the 
values  of  the  control  variable  A'  are  assigned  at  random.  The  outcomes  of  such 
experiments  are  the  causal  effects  probabilities,  P(yx)  and  P(yxi),  which  are  also 
induced  by  the  submodels  Mx  and  Mx>,  respectively.  However,  Definition  16 
is  applicable  to  a  much  broader  class  of  experimental  designs,  corresponding 
to  both  deletion  and  replacement  of  the  model  equations.  Note  that  standard 
identifiability  (Definition  15)  is  a  special  case  of  identifiability  from  experiments, 
where  Y  =  V  and  Mexp  =  M. 


7.4  Bounds  and  Conditions  of  Identification 

In  this  section  we  estimate  the  three  probabilities  of  causation  defined  in  Sec¬ 
tion  7.3  when  given  experimental  or  nonexperimental  data  (or  both)  and  ad¬ 
ditional  assumptions  about  the  data-generating  process.  We  will  assume  that 
experimental  data  will  be  summarized  in  the  form  of  the  causal  effects  P(yx)  and 

8The  causal  effects  P(yx)  and  P(yX')  can  be  estimated  reliably  from  controlled  experimental 
studies,  and  from  certain  observational  (i.e.,  nonexperimental)  studies  (see  Chapter  5). 


128 


P(yx /),  and  nonexperimental  data  will  be  summarized  in  the  form  of  the  joint 
probability  function:  PXy  —  {-P(x,  y),  P{x',  y),  P(x,  y'),  P(x',  y')}.  9 

7.4.1  Linear  programming  formulation 

In  principle,  in  order  to  compute  the  probability  of  any  counterfactual  sentence 
involving  variables  X  and  Y  we  need  to  specify  a  functional  causal  model,  namely, 
the  functional  relation  between  X  and  Y  and  the  probability  distribution  on  U. 
However,  since  every  such  model  induces  a  joint  probability  distribution  on  the 
four  binary  variables:  X,  Y,  Yx  and  Yxi,  specifying  the  sixteen  parameters  of  this 
distribution  would  suffice.  Moreover,  since  Y  is  a  deterministic  function  of  the 
other  three  variables,  the  problem  is  fully  specified  by  the  following  set  of  eight 
parameters: 


Pill  = 

P(yx,  Vx' 

,x) 

11 

y,yX') 

PllO  = 

P{Vxi  Vx'  ) 

x ') 

=  P(x' 

,  v,  yx) 

PlOl  = 

P(yx,y’x> 

,x) 

=  P(x, 

y,y'X') 

PlOO  = 

p(yx,y'X’, 

X ') 

=  P(x' 

,  y',  yx) 

Poll  = 

P{y'xY.Jx': 

,x) 

=  P(x, 

y\yX') 

Poio  = 

P{y'xyyx', 

x') 

=  P(x' 

,  y,  y'x) 

Pool  = 

P{y'x,y'x' 

,x) 

=  P(x, 

y\y'x) 

Pooo  “ 

Piy'x-.y'x” 

X') 

-  P{x' 

i  y  ■  Vx) 

where  we  have  used  the  consistency  condition  Eq.  (7.14).  These  parameters  are 
constrained  by  the  probabilistic  constraints 

Y.Y.Y.PP  =  1 

2—0  j=0  k—0 

Vijk  >  o  for  i,j,  k  e  { 0,1}  (7.15) 

In  addition,  the  nonexperimental  probabilities  Pxy  impose  the  constraints: 

Pm+Pioi  —  P  (x,y) 

Pon  +  Pool  =  P{x,y)  (7.16) 

Pno  +  Poio  =  P{x',y ) 

9  For  example,  if  x  represents  a  specific  exposure  and  y  represents  the  outcome  of  a  specific 
individual  I,  then  Pxy  is  estimated  from  sampled  frequency  counts  in  a  population  that  is 
deemed  representative  of  the  relevant  characteristics  of  I.  The  choice  of  an  appropriate  reference 
population  is  usually  based  on  causal  consideration  (often  suppressed),  and  involves  matching 
the  characteristics  of  I  against  the  causal  model  (M,  P[u))  judged  to  govern  the  population. 
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and  the  causal  effects,  P(yx)  and  P(yx>),  impose  the  constraints: 


P(lJx )  —  Pm  +  Pno  +  Pioi  +P100 

P{Vxf)  —  Pin  +  P110  +  Pou  +  P010  (7.17) 


The  quantities  we  wish  to  bound  are: 

PNS  = 
PN  = 
PS  = 


P101  +  P100 

(7.18) 

Pwi/P(x,y) 

(7.19) 

Pioo/P(x',y') 

(7.20) 

In  the  following  sections  we  obtain  bounds  for  these  quantities  by  solving  var¬ 
ious  linear  programming  problems.  For  example,  given  both  experimental  and 
nonexperimental  data,  the  lower  (and  upper)  bounds  for  PNS  are  obtained  by 
minimizing  (or  maximizing,  respectively)  p10i  +  P100  subject  to  the  constraints 
(7.15),  (7.16)  and  (7.17).  The  bounds  obtained  are  guaranteed  to  be  sharp  be¬ 
cause  the  optimization  is  global. 

Optimizing  the  functions  in  (7.18)-(7.20),  subject  to  equality  constraints,  de¬ 
fines  a  linear  programming  (LP)  problem  that  lends  itself  to  closed-form  solution. 
[Bal95,  Appendix  B]  describes  a  computer  program  that  takes  symbolic  descrip¬ 
tions  of  LP  problems  and  returns  symbolic  expressions  for  the  desired  bounds. 
The  program  works  by  systematically  enumerating  the  vertices  of  the  constraint 
polygon  of  the  dual  problem.  The  bounds  reported  in  this  chapter  were  produced 
(or  tested)  using  Balke’s  program,  and  will  be  stated  here  without  proofs;  their 
correctness  can  be  verified  by  manually  enumerating  the  vertices  as  described  in 
[Bal95,  Appendix  Bj. 


7.4.2  Bounds  with  no  assumptions 
7.4.2. 1  Given  nonexperimental  data 


Given  Pxy,  constraints  (7.15)  and  (7.16)  induce  the  following  upper  bound  on 
PNS: 


0  <  PNS  <  P(x,  y)  +  P(x',  y'). 


(7.21) 


However,  PN  and  PS  are  not  constrained  by  Pxy- 

These  constraints  also  induce  bounds  on  the  causal  effects  P(yx)  and  P(yx>): 


P(x,  y)  <  P(yx)  <  1  -  P(x,  y') 

P{x\y )  <P(yx')<  1  -P(x',y')  (7.22) 
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7. 4. 2. 2  Given  causal  effects 

Given  constraints  (7.15)  and  (7.17),  the  bounds  induced  on  PNS  are: 

max[0,  P{yx)  -  P(yx, )]  <  PNS  <  min [P(yx),  P(y’x, )]  (7.23) 

with  no  constraints  on  PN  and  PS. 


7. 4. 2. 3  Given  both  nonexperimental  data  and  causal  effects 


Given  the  constraints  (7.15),  (7.16)  and  (7.17),  the  following  bounds  are  induced 
on  the  three  probabilities  of  causation: 


0 

r  p{yx)  \ 

max  < 

P(yx)-P(yX') 
P(y )  -  P{yx') 

>  <  PNS  <  min  < 

PWX.) 

P(x,y)  +  P(x',y')  f 

{  P(yx)-P(y) 

.  P(yx)  -  P(yx>)  +P(x,y')  +  P{x',y)  J 

(7.24) 


\  o  1 

l  <  PN  <  min  < 

f  1  1 

max  < 

i  P(y)-P(vn)  | 

l  P(y'rJ)-P{x' ,y')  t 

1 

(  PS>y)  ) 

I 

L  P(*,y)  ) 

(7.25) 


max 


0 

p(v»)-p(v) 

P(x’,y') 


<  PS  <  min 


1 

P{Vx)-P(x,y ) 
P{x',y') 


(7.26) 


Thus  we  see  that  some  information  about  PN  and  PS  can  be  extracted  with¬ 
out  making  any  assumptions  about  the  data-generating  process.  Furthermore, 
combined  data  from  both  experimental  and  nonexperimental  studies  yield  infor¬ 
mation  that  neither  study  alone  can  provide. 


7,4.3  Bounds  under  exogeneity  (no  confounding) 


Definition  17  ( Exogeneity ) 

A  variable  X  is  said  to  be  exogenous  for  Y  in  model  M  iff 


P(Vx)  =  P(y\x) 


and  P(yx>)  =  P{y\x'), 


(7.27) 


or,  equivalently, 


YxALX  and  Yx,ALX.  (7.28) 

In  words,  the  way  Y  would  potentially  respond  to  experimental  conditions  x  or  x' 
is  independent  of  the  actual  value  of  X. 
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Eq.  (7.27)  has  been  given  a  variety  of  (equivalent)  definitions  and  interpreta¬ 
tions.  Epidemiologists  refer  to  this  condition  as  “no-confounding”  [RG89],  statis¬ 
ticians  call  it  “as  if  randomized,”  and  [RR83]  call  it  “weak  ignorability.”  A 
graphical  criterion  ensuring  exogeneity  is  the  absence  of  a  common  ancestor  of 
X  and  Y  in  G(M)  (more  precisely,  a  common  ancestor  that  is  connected  to  Y 
through  a  path  not  containing  X,  including  latent  ancestors,  which  represent 
dependencies  among  variables  in  U).  The  classical  econometric  criterion  for  exo¬ 
geneity  (e.g.,  [Dhr70,  p.  169])  states  that  X  be  independent  of  the  error  term  (it) 
in  the  equation  for  F.10  We  will  use  the  term  “exogeneity”,  since  it  was  under 
this  term  that  the  relations  given  in  (7.27)  first  received  their  precise  definition 
(by  economists). 

Combining  Eq.  (7.27)  with  the  constraints  of  (7. 15)— (7. 17) ,  the  linear  pro¬ 
gramming  optimization  (Section  7.4.1)  yields  the  following  results: 

Theorem  23  Under  condition  of  exogeneity,  the  three  probabilities  of  causation 
are  bounded  as  follows: 

max[0,  P(y\x)  —  P(y\x')\  <  PNS  <  mm[P(y\x),  P(y'\x')\ 
max[0,  P(y\x)  -  P(y\x')}  mm[P(y\x),  P{y'\x')] 

P(y\x)  ~  ~  P(y\x) 

max[0,  P(y\x)  -  P(jj\x')\  <  ps  <  mm[P(y\x),  P(y'\x')\ 

P(y'\x')  ~  ~  P{y'\x') 

The  bounds  expressed  in  Eq.  (7.30)  were  first  derived  by  [RG89];  a  more 
elaborate  proof  can  be  found  in  [FS99].  [Pea99]  derived  Eqs.  (7.29)-(7.31)  under 
a  stronger  condition  of  exogeneity  (see  Definition  18).  We  see  that  under  the 
condition  of  no-confounding  the  lower  bound  for  PN  can  be  expressed  as 

PN  >  1 - - - =  1 - — 

P(y\x)/ P(y\x')  RR 

where  RR  =  P(y\x)/P(y\x')  is  the  risk  ratio  (also  called  relative  risk )  in  epi¬ 
demiology.  Courts  have  often  used  the  condition  RR  >  2  as  a  criterion  for  legal 
responsibility  [BGG94].  Eq.  (7.32)  shows  that  this  practice  represents  a  conser¬ 
vative  interpretation  of  the  “more  probable  than  not”  standard  (assuming  no 
confounding);  PN  must  indeed  be  higher  than  0.5  if  RR  exceeds  2.  [FS99]  argue 
that,  in  general,  epidemiological  evidence  may  not  be  applicable  as  proof  for  spe¬ 
cific  causation  [FS99]  because  such  evidence  cannot  account  for  all  characteristics 

10This  criterion  has  been  the  subject  of  relentless  objections  by  modern  econometricians 
[EHR83,  Hen95,  imb97],  but  see  [Ald93]  and  [PeaOO,  pp.  169-170;  245-247]  for  a  reconciliatory 
perspective  on  this  controversy. 


(7.29) 

(7.30) 

(7.31) 
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specific  to  the  plaintiff.  Freedman  and  Stark  further  imply  that  the  appropriate 
way  of  interpreting  the  “more  probable  than  not”  criterion  would  be  to  consider 
the  probability  of  causation  in  a  restricted  subpopulation,  one  that  shares  the 
plaintiff  characteristics.  Taken  to  extreme,  such  restrictive  interpretation  would 
insist  on  characterizing  the  plaintiff  to  minute  detail,  and  would  reduce  PN  to 
zero  or  one  when  all  relevant  details  are  accounted  for.  We  doubt  that  this  in¬ 
terpretation  underlies  the  intent  of  judicial  standards.  We  believe  that,  by  using 
the  wording  “more  probable  than  not,”  law  makers  have  instructed  us  to  ignore 
specific  features  for  which  data  is  not  available,  and  to  base  our  determination  on 
the  most  specific  features  for  which  reliable  data  is  available  (see  footnote  9). 11 
PN  ensures  us  that  two  obvious  features  of  the  plaintiff  will  not  be  ignored: 
the  exposure,  x,  and  the  injury,  y.  In  contrast,  these  two  features  are  ignored 
in  the  causal  effect  measure  P(yx )  which  is  a  quantity  averaged  over  the  entire 
population,  including  unexposed  and  uninjured. 


7. 4. 3.1  Bounds  under  strong  exogeneity 


The  condition  of  exogeneity,  as  defined  in  Eq.  (7.27)  is  testable  by  comparing 
experimental  and  nonexperimental  data.  A  stronger  version  of  exogeneity  can 
be  defined  as  the  joint  independence  {Yx,Yx>}JLX  which  was  called  “strong  ig- 
norability”  by  Rosenbaum  and  Rubin  [RR83].  Though  untestable,  such  joint 
independence  is  assumed  to  hold  when  we  assert  the  absence  of  factors  that  si¬ 
multaneously  affect  exposure  and  outcome. 


Definition  18  ( Strong  Exogeneity) 

A  variable  X  is  said  to  be  strongly  exogenous  for  Y  in  model  M  iff  {Yx,  Yx'}ALX , 
that  is, 


P(yx,Vx'\x) 

P(.yx,y'X‘ \x) 

P{y'xriJx'\x) 

P{y'x>y'x' It) 


—  P(yxi  yx1) 
=  P(yx,y'x>) 
=  P(y'x,yx') 
=  P{y'x,y'x') 


(7.33) 


The  four  conditions  in  (7.33)  are  sufficient  to  represent  {Yx,  Yx*}ALX,  because  for 
every  event  E  we  have 

P{E\x)  =  P(E)  ==>  P(E\x')  =  P(E).  (7.34) 


Remarkably,  the  added  constraints  introduced  by  strong  exogeneity  do  not 
alter  the  bounds  of  Eqs.  (7.29)-(7.31).  They  do,  however,  strengthen  Lemma  20: 

11  Our  results  remain  valid  when  we  condition  Pxy  on  a  set  of  covariates  that  characterize 
the  specific  case  at  hand. 
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Theorem  24  If  strong  exogeneity  holds,  the  probabilities  PN,  PS,  and  PNS  are 
constrained  by  the  bounds  of  Eqs.  (7.29)-(7.31) ,  and,  moreover,  PN,  PS,  and 


as  follows  [Pea99]  : 

PNS 

(7.35) 

pn  =  , 

P{y\x) 

PNS 

(7.36) 

pc  __ 

P(y' \x') 

7.4.4  Identiflability  under  monotonicity 
Definition  19  ( Monotonicity ) 

A  variable  Y  is  said  to  be  monotonic  relative  to  variable  X  in  a  functional  causal 
model  M  iff 

y'x  A  yx>  '=  false  (7.37) 


Monotonicity  expresses  the  assumption  that  a  change  from  X  =  false  to  X  = 
true  cannot,  under  any  circumstance  make  Y  change  from  true  to  false.  In 
epidemiology,  this  assumption  is  often  expressed  as  “no  prevention,”  that  is, 
no  individual  in  the  population  can  be  helped  by  exposure  to  the  risk  factor. 
[BP97]  used  this  assumption  to  tighten  bounds  of  treatment  effects  from  studies 
involving  non-compliance.  Glymour  [Gly98]  and  Cheng  [Che97]  resort  to  this 
assumption  in  using  disjunctive  or  conjunctive  relationships  between  causes  and 
effects,  excluding  functions  such  as  exclusive-or,  or  parity. 

In  the  linear  programming  formulation  of  Section  7.4.1,  monotonicity  narrows 
the  feasible  space  to  the  manifold: 


Pon  —  0 

Poio  =  0  (7.38) 

7.4.4. 1  Given  nonexperimental  data 

Under  the  constraints  (7.15),  (7.16),  and  (7.38),  we  find  the  same  bounds  for 
PNS  as  the  ones  obtained  under  no  assumptions  (Eq.  (7.21)).  Moreover,  there 
are  still  no  constraints  on  PN  and  PS.  Thus,  with  nonexperimental  data  alone, 
the  monotonicity  assumption  does  not  provide  new  information. 

However,  the  monotonicity  assumption  induces  sharper  bounds  on  the  causal 
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effects  P(yx)  and  P{yx'): 


P(y )  <P{Vx)<  1  -  P{x,y') 

P{x',y)  <  P(yX')  <  P(y)  (7.39) 

Compared  with  Eq.  (7.22),  the  lower  bound  for  P(yx)  and  the  upper  bound  for 
P(yx<)  are  tightened.  The  importance  of  Eq.  (7.39)  lies  in  providing  a  simple 
necessary  test  for  the  assumption  of  monotonicity.  These  inequalities  are  sharp, 
in  the  sense  that  every  combination  of  experimental  and  non-experimental  data 
that  satisfy  these  inequalities  can  be  generated  from  some  functional  causal  model 
in  which  Y  is  monotonic  in  X. 

That  the  commonly  made  assumption  of  “no-prevention”  is  not  entirely  ex¬ 
empt  from  empirical  scrutiny  should  come  as  a  relief  to  many  epidemiologists. 
Alternatively,  if  the  no-prevention  assumption  is  theoretically  unassailable,  the 
inequalities  of  Eq.  (7.39)  can  be  used  for  testing  the  compatibility  of  the  exper¬ 
imental  and  non-experimental  data,  namely,  whether  subjects  used  in  clinical 
trials  were  sampled  from  the  same  target  population,  characterized  by  the  joint 
distribution  Pxy- 

7. 4. 4. 2  Given  causal  effects 

Constraints  (7.15),  (7.17),  and  (7.38)  induce  no  constraints  on  PN  and  PS,  while 
the  value  of  PNS  is  fully  determined: 

PNS  =  p{yxM  =  P(y*)-P{y*) 

That  is,  under  the  assumption  of  monotonicity,  PNS  can  be  determined  by  ex¬ 
perimental  data  alone,  despite  the  fact  that  the  joint  event  yx  A  y'x,  can  never  be 
observed. 

7. 4. 4. 3  Given  both  nonexperimental  data  and  causal  effects 

Under  the  constraints  (7.15)-(7.17)  and  (7.38),  the  values  of  PN,  PS,  and  PNS 
are  all  determined  precisely. 

Theorem  25  IfY  is  rnonotonic  relative  to  X,  then  PNS,  PN,  and  PS  are  given 
by 


PNS  = 

P(yx,y'x>)  =  P{yx)  -  P(yX') 

(7.40) 

PN  = 

P(v>  \x  v)  - 
P{yX'\x,y)  p^y) 

(7-41) 

PS  = 

P{x  ,y  ) 

(7.42) 
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Corollary  2  If  Y  is  monotonic  relative  to  X ,  then  PNS,  PN,  and  PS  are  iden¬ 
tifiable  whenever  the  causal  effects  P(yx )  and  P(yx')  are  identifiable, 

Eqs.  (7.40)-(7.42)  are  applicable  to  situations  where,  in  addition  to  obser¬ 
vational  probabilities,  we  also  have  information  about  the  causal  effects  P(yx ) 
and  P(yx>).  Such  information  may  be  obtained  either  directly,  through  separate 
experimental  studies,  or  indirectly,  from  observational  studies  in  which  certain 
identifying  assumptions  are  deemed  plausible  (e.g.,  assumptions  that  permits 
identification  through  adjustment  of  covariates).  Note  that  the  identification  of 
PN  requires  only  P(yx>)  while  that  of  PS  requires  P(yx).  In  practice,  however, 
any  method  that  yields  the  former  also  yields  the  latter. 

One  common  class  of  models  which  permits  the  identification  of  P(yx )  is  called 
Markovian. 

Definition  20  ( Markovian  models) 

A  functional  causal  model  M  is  said  to  be  Markovian  if  the  graph  G(M)  associated 
with  M  is  acyclic,  and  if  the  exogenous  factors  Ui  are  mutually  independent.  A 
model  is  semi-Markovian  iff  G(M)  is  acyclic  and  the  exogenous  variables  are 
not  necessarily  independent.  A  functional  causal  model  is  said  to  be  positive- 
Markovian  if  it  is  Markovian  and  P(v)  >  0  for  every  v. 

It  is  shown  in  [Pea93,  Pea95a]  that  for  every  two  variables,  X  and  Y.  in  a 
positive-Markovian  model  M,  the  causal  effects  P(yx)  and  P{yx>)  are  identifiable 
and  are  given  by 

p(Vx)  =  ^2P(y\pax,x)P(pax) 

pax 

piVx')  =  J ~2P(y\pax,x')P(pax )  (7.43) 

pax 

where  pax  are  (values  of)  the  parents  of  X  in  the  causal  graph  associate  with  M 
(see  also  [SGS93],  [Rob86],  and  [PeaOO,  p.  73]).  Thus,  we  can  combine  Eq.  (7.43) 
with  Theorem  25  and  obtain  a  concrete  condition  for  the  identification  of  the 
probability  of  causation. 

Corollary  3  If  in  a  positive-Markovian  model  M.  the  function  Yx{u)  is  mono¬ 
tonic,  then  the  probabilities  of  causation  PNS,  PS  and  PN  are  identifiable  and 
are  given  by  Eqs.  (I.f0)-(7.f2),  with  P(yx )  given  in  Eq.  (7.43).  If  monotonicity 
cannot  be  ascertained,  then  PNS,  PN  and  PS  are  bounded  by  Eqs.  (7.24)~(7.26), 
with  P(yx )  given  in  Eq.  (7.43). 
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Broader  identification  conditions  can  be  obtained  through  the  use  of  the  cri¬ 
teria  for  identifying  Px(y)  in  Chapter  5.  In  particular,  Theorem  17  leads  to  the 
following  corollary: 

Corollary  4  Let  GP  be  the  class  of  semi-Markovian  models  that  satisfy  the 
graphical  criterion  of  Theorem  17.  IfYx(u )  is  monotonic,  then  the  probabilities  of 
causation  PNS,  PS  and  PN  are  identifiable  in  GP  and  are  given  by  Eqs.  (7.40)- 
(7.42),  with  P(yx )  determined  by  the  topology  of  G(M)  through  Theorem  17. 


7.4.5  Identifiability  under  monotonicity  and  exogeneity 

Under  the  assumption  of  monotonicity,  if  we  further  assume  exogeneity,  then 
P(yx)  and  P(yx')  are  identified  through  Eq.  (7.27),  and  from  theorem  25  we 
conclude  that  PNS,  PN,  and  PS  are  all  identifiable. 


Theorem  26  (Identifiability  under  exogeneity  and  monotonicity) 

If  X  is  exogenous  and  Y  is  monotonic  relative  to  X ,  then  the  probabilities  PN, 
PS,  and  PNS  are  all  identifiable,  and  are  given  by 


PNS  = 

P(y\x)  -  P(y\x ') 

PN  = 

1 

ST 

8 

II 

P(y\x)  -  P(y\x‘ ) 

P(x,y) 

P(y\x) 

PS  = 

P(y\x)  -  P(y) 

P(y\x)  -  P(y\x ') 

P(x',y') 

P{y'\x ') 

(7.44) 

(7.45) 

(7.46) 


These  expressions  are  to  be  recognized  as  familiar  measures  of  attribution  that 
often  appear  in  the  literature.  The  r.h.s.  of  (7.44)  is  called  “risk-difference” 
in  epidemiology,  and  is  also  misnamed  “attributable  risk”  [HB87,  p.  87].  The 
probability  of  necessity,  PN,  is  given  by  the  excess-risk-ratio  (ERR) 


PN  - 


P(y\x )  ~  P(y\x')  _  1 

P(y\x)  RR 


(7.47) 


often  misnamed  as  the  attributable  fraction  [Sch82],  attributable-rate  percent  [HB87, 
p.  88],  attributed  fraction  for  the  exposed  [KWE96,  p.  38],  or  attributable  propor¬ 
tion  [Col97] .  The  reason  we  consider  these  labels  to  be  misnamed  is  that  ERR 
invokes  purely  statistical  relationships,  hence  it  cannot  in  itself  serve  to  mea¬ 
sure  attribution,  unless  fortified  with  some  causal  assumptions.  Exogeneity  and 
monotonicity  are  the  causal  assumptions  that  endow  ERR  with  attributional  in¬ 
terpretation,  and  these  assumptions  are  rarely  made  explicit  in  the  literature  on 
attribution. 
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The  expression  for  PS  is  likewise  quite  revealing 

PS  =  [P(y\x)  -  P(y \x')]/[l  -  P(y\x%  (7.48) 

as  it  coincides  with  what  epidemiologists  call  the  “relative  difference”  [She58], 
which  is  used  to  measure  the  susceptibility  of  a  population  to  a  risk  factor  x.  It 
also  coincides  with  what  Cheng  calls  “causal  power”  [Che97],  namely,  the  effect 
of  x  on  y  after  suppressing  “all  other  causes  of  y.”  See  [Pea99]  for  additional 
discussions  of  these  expressions. 

To  appreciate  the  difference  between  Eqs.  (7.41)  and  (7.47)  we  can  rewrite 
Eq.  (7.41)  as 


P(y\x)P(x)  +  P(y\x')P(x')  -  P{yx>) 
P(y\x)P(x) 

P{y\x )  -  P{y\x')  P(y\x')  -  P(yx,) 

P{y\x)  P(x,y ) 


(7.49) 


The  first  term  on  the  r.h.s.  of  (7.49)  is  the  familiar  ERR  as  in  (7.47),  and  repre¬ 
sents  the  value  of  PN  under  exogeneity.  The  second  term  represents  the  correction 
needed  to  account  for  X’s  non-exogeneity,  i.e.  P(yx')  ^  P(y\x ')■  We  will  call  the 
r.h.s.  of  (7.49)  by  corrected  excess-risk-ratio  (CERR). 

From  Eqs.  (7.44)-(7.46)  we  see  that  the  three  notions  of  causation  satisfy  the 
simple  relationships  given  by  Eqs.  (7.35)  and  (7.36)  which  we  obtained  under  the 
strong  exogeneity  condition.  In  fact,  we  have  the  following  theorem. 


Theorem  27  Monotonicity  (7.37)  and  exogeneity  (7.27)  together  imply  strong 
exogeneity  (7.33). 


Proof  of  Theorem  27: 

From  the  monotonicity  condition,  we  have 


Vx>  =  Vx1  A  (yx  V  y'x)  =  (yx>  A  yx)  V  (yx>  V  y'x)  =  yx>  A  yx.  (7.50) 

Thus  we  can  write 

P{yx')  =  P(Vx,Vx>),  (7.51) 

and 

P(y\x')  =  P{yX'\xl)  =  P{yx,yx>\x')  (7.52) 

where  consistency  condition  (7.14)  is  used.  The  exogeneity  condition  (7.27)  allows 
us  to  equate  (7.51)  and  (7.52),  and  we  obtain 

P{yx,yx'\x)  =  P{yx,yx'),  (7.53) 
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Table  7.1:  PN  (the  probability  of  necessary  causation)  as  a  function  of  assump¬ 
tions  and  available  data.  ERR  stands  of  the  excess-risk-ratio  1  —  P(y\x')/P(y\x ) 
and  CERR  is  given  in  Eq.  (7.49).  The  non-entries  ( — )  represent  vacuous  bounds, 
that  is,  0  <  PN  <  1. _ 


Assumptions 

Data  Available 

Exogeneity 

Monotonicity 

Experimental 

N  onexp  er  iment  al 

Combined 

+ 

+ 

ERR 

ERR 

ERR 

+ 

— 

bounds 

bounds 

bounds 

- 

+ 

— 

— 

CERR 

— 

- 

— 

— 

bounds 

which  implies  the  first  of  the  four  conditions  in  (7.33): 

P(yx,  Vx'\x)  =  P(yx,  Vx')-  (7.54) 


Combining  Eq.  (7.54)  with 

P(yx)  =  P(y*,yx>)  +  P(yx,y'X'),  (7.55) 

P{y\x)  =  P(yx\x)  =  P(yx,yx'\x)  +  P{yx,y'x,\x),  (7.56) 

and  the  exogeneity  condition  (7.27),  we  obtain  the  second  equation  in  (7.33): 

P(yx,y'M  =  P(yx,y'x>)-  (7.57) 

Both  sides  of  the  third  equation  in  (7.33)  are  equal  to  zero  from  monotonicity 
condition  and  the  last  equation  in  (7.33)  follows  because  the  four  quantities  sum 
up  to  1  on  both  sides  of  the  four  equations.  □ 

7.4.6  Summary  of  results 

We  now  summarize  the  results  from  Section  7.4  that  should  be  of  value  to  prac¬ 
ticing  epidemiologists  and  policy  makers.  These  results  are  shown  in  Table  7.1, 
which  lists  the  best  estimate  of  PN  under  various  assumptions  and  various  types 
of  data — the  stronger  the  assumptions,  the  more  informative  the  estimates. 

We  see  that  the  excess- risk-ratio  (ERR),  which  epidemiologists  commonly 
identify  with  the  probability  of  causation,  is  a  valid  measure  of  PN  only  when 
two  assumptions  can  be  ascertained:  exogeneity  (i.e.,  no  confounding)  and  mono¬ 
tonicity  (i.e.,  no  prevention).  When  monotonicity  does  not  hold,  ERR  provides 
merely  a  lower  bound  for  PN,  as  shown  in  Eq.  (7.30).  (The  upper  bound  is 
usually  unity.)  In  the  presence  of  confounding,  ERR  must  be  corrected  by  the 
additive  term  [P(y \x')  —  P(yx')]/P(x,y),  as  stated  in  (7.49).  In  other  words, 
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when  confounding  bias  (of  the  causal  effect,)  is  positive,  PN  is  higher  than  ERR 
by  the  amount  of  this  additive  term.  Clearly,  owing  to  the  division  by  P(x,y), 
the  PN  bias  can  be  many  times  higher  than  the  causal  effect  bias  P{y\x')  —  P(yx>). 
However,  confounding  results  only  from  association  between  exposure  and  other 
factors  that  affect  the  outcome;  one  need  not  be  concerned  with  associations 
between  such  factors  and  susceptibility  to  exposure,  as  is  often  assumed  in  the 
literature  [KFG89,  Gly98]. 

The  last  two  rows  in  Table  7.1  correspond  to  no  assumptions  about  exogene¬ 
ity,  and  they  yield  vacuous  bounds  for  PN  when  data  come  from  either  experi¬ 
mental  or  observational  study.  In  contrast,  informative  bounds  (7.25)  or  point 
estimates  (7.49)  are  obtained  when  data  from  experimental  and  observational 
studies  are  combined.  Concrete  use  of  such  combination  will  be  illustrated  in 
Section  7.5. 


7.5  Example  1:  Legal  Responsibility 

A  lawsuit  is  filed  against  the  manufacturer  of  drug  x,  charging  that  the  drug  is 
likely  to  have  caused  the  death  of  Mr.  A,  who  took  the  drug  to  relieve  symptom 
S  associated  with  disease  D. 

The  manufacturer  claims  that  experimental  data  on  patients  with  symptom  S 
show  conclusively  that  drug  x  may  cause  only  minor  increase  in  death  rates.  The 
plaintiff  argues,  however,  that  the  experimental  study  is  of  little  relevance  to  this 
case,  because  it  represents  the  effect  of  the  drug  on  all  patients,  not  on  patients 
like  Mr.  A  who  actually  died  while  using  drug  x.  Moreover,  argues  the  plaintiff, 
Mr.  A  is  unique  in  that  he  used  the  drug  on  his  own  volition,  unlike  subjects 
in  the  experimental  study  who  took  the  drug  to  comply  with  experimental  pro¬ 
tocols.  To  support  this  argument,  the  plaintiff  furnishes  nonexperimental  data 
indicating  that  most  patients  who  chose  drug  x  would  have  been  alive  if  it  were 
not  for  the  drug.  The  manufacturer  counter-argues  by  stating  that:  (1)  coun- 
terfactual  speculations  regarding  whether  patients  would  or  would  not  have  died 
are  purely  metaphysical  and  should  be  avoided,  and  (2)  nonexperimental  data 
should  be  dismissed  a  priori,  on  the  ground  that  such  data  may  be  highly  biased; 
for  example,  incurable  terminal  patients  might  be  more  inclined  to  use  drug  x  if 
it  provides  them  greater  symptomatic  relief.  The  court  must  now  decide,  based 
on  both  the  experimental  and  non-experimental  studies,  what  the  probability  is 
that  drug  x  was  in  fact  the  cause  of  Mr.  A’s  death. 

The  (hypothetical)  data  associated  with  the  two  studies  are  shown  in  Table 
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Table  7.2:  Frequency  data  (hypothetical)  obtained  in  experimental  and  nonex- 
perimental  studies,  comparing  deaths  (in  thousands)  among  drug  users,  x,  and 
non-users,  x' . _ 


Experimental 

Nonexperimental 

x  x' 

x 

x! 

Deaths(y) 

Survivals(y') 

16  14 

984  986 

2 

998 

28 

972 

7.2.  The  experimental  data  provide  the  estimates 


P(Vx) 

=  16/1000 

=  0.016 

P(yx>) 

=  14/1000 

=  0.014 

p{y'x>) 

=  1  -  P(Vx>) 

=  0.986 

The  non-experimental  data  provide  the  estimates 


P(y)  =  30/2000  -  0.015 

P(x,y)  =  2/2000  =  0.001 

P{x',y')  =  972/2000  =  0.486 


Since  both  the  experimental  and  nonexperimental  data  are  available,  we  can 
obtain  bounds  on  all  three  probabilities  of  causation  through  Eqs.  (7.24)-(7.26) 
without  making  any  assumptions  about  the  underlying  mechanisms.  The  data  in 
Table  7.2  imply  the  following  numerical  results: 


0.002 

<  PNS  < 

0.016 

(7.58) 

1.0 

<  PN  < 

1.0 

(7.59) 

0.002 

<PS< 

0.031 

(7.60) 

These  figures  show  that  although  surviving  patients  who  didn’t  take  drug  x  have 
only  less  than  3.1%  chance  to  die  had  they  taken  the  drug,  there  is  100%  assur¬ 
ance  (barring  sample  errors)  that  those  who  took  the  drug  and  died  would  have 
survived  had  they  not  taken  the  drug.  Thus  the  plaintiff  was  correct;  drug  x  was 
in  fact  responsible  for  the  death  of  Mr.  A. 

If  we  assume  that  drug  x  can  only  cause,  but  never  prevent,  death,  Theorem  25 
is  applicable  and  Eqs.  (7.40)-(7.42)  yield 


PNS  = 

0.002 

(7.61) 

PN  = 

1.0 

(7.62) 

PS  = 

0.002 

(7.63) 
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Thus,  we  conclude  that  drug  x  was  responsible  for  the  death  of  Mr.  A,  with  or 
without  the  no-prevention  assumption. 

Note  that  a  straightforward  use  of  the  experimental  excess-risk-ratio  would 
yield  a  much  lower  (and  incorrect)  result: 


P(yx)  -  P(y xQ  =  Q.016-0.014 
P{yx)  0.016 


=  0.125 


(7.64) 


Evidently,  what  the  experimental  study  does  not  reveal  is  that,  given  a  choice, 
terminal  patients  stay  away  from  drug  x.  Indeed,  if  there  were  any  terminal 
patients  who  would  choose  x  (given  the  choice),  then  the  control  group  (x1)  would 
have  included  some  such  patients  (due  to  randomization)  and  so  the  proportion 
of  deaths  among  the  control  group  P{yx')  would  have  been  higher  than  P(x' ,  y ), 
the  population  proportion  of  terminal  patients  avoiding  x.  However,  the  equality 
P(yxi)  =  P(y,x')  tells  us  that  no  such  patients  were  present  in  the  control  group, 
hence  (by  randomization)  no  such  patients  exist  in  the  population  at  large  and 
therefore  none  of  the  patients  who  freely  chose  drug  x  was  a  terminal  case;  all 
were  susceptible  to  x. 

The  numbers  in  Table  7.2  were  obviously  contrived  to  show  the  usefulness 
of  the  bounds  in  Eqs.  (7.24)-(7.26).  Nevertheless,  it  is  instructive  to  note  that 
a  combination  of  experimental  and  non-experimental  studies  may  unravel  what 
experimental  studies  alone  will  not  reveal.  In  addition,  such  combination  may 
provide  a  test  for  the  assumption  of  no-prevention,  as  outlined  in  Section  7.4.4. 1. 
For  example,  if  the  frequencies  in  Table  2  were  slightly  different,  they  could 
easily  violate  the  inequalities  of  Eq.  (7.39).  Such  violation  may  be  due  either  to 
nonmonotonicity  or  to  incompatibility  of  the  experimental  and  nonexperimental 
groups. 

This  last  point  may  warrant  a  word  of  explanation,  lest  the  reader  wonders 
why  two  data  sets,  taken  from  two  separate  groups  under  different  experimental 
conditions,  should  constrain  one  another.  The  explanation  is  that  certain  quan¬ 
tities  in  the  two  subpopulations  are  expected  to  remain  invariant  to  all  these 
differences,  provided  that  the  two  subpopulations  were  sampled  properly  from 
the  same  general  population.  In  fact,  every  quantity  of  the  form  P(Q),  where 
Q  is  computable  from  a  functional  causal  model  M,  enjoys  this  invariance  prop¬ 
erty,  because  the  two  subpopulations  are  assumed  to  be  governed  by  the  same 
functional  causal  model.  Thus,  the  question  whether  two  data  sets,  obtained 
under  different  experimental  conditions,  should  constrain  one  another  reduces 
to  a  purely  mathematical  question  of  whether  the  quantities  that  represent  the 
two  experimental  conditions,  P(Q)  and  P(Q'),  necessarily  constrain  one  another 
in  the  same  functional  causal  model  considered.  In  our  case,  the  quantities  in 
question  are  simply  the  causal  effects  probabilities,  P(yx>)  and  P{yx)-  Although 
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these  probabilities  were  not  measured  in  the  nonexperimental  group,  they  must 
nevertheless  be  the  same  as  those  measured  in  the  experimental  group.  The 
invariance  of  these  quantities  is  the  basic  axiom  of  controlled  experimentation, 
without  which  no  inference  would  be  possible  from  experimental  studies  to  gen¬ 
eral  behavior  of  the  population.  This  invariance,  together  with  monotonicity, 
imply  the  inequalities  of  (7.39). 

7.6  Example  2:  Personal  Decision  Making 

Consider  the  case  of  Mr.  B,  who  is  one  of  the  surviving  patients  in  the  observa¬ 
tional  study  of  Table  7.2.  Mr.  B  wonders  how  safe  it  would  be  for  him  to  take 
drug  x,  given  that  he  has  refrained  thus  far  from  taking  the  drug  and  that  he 
managed  to  survive  the  disease.  His  argument  for  switching  to  the  drug  rests 
on  the  observation  that  only  2  out  of  1000  drug  users  died  in  the  observational 
study,  which  he  considers  a  rather  small  risk  to  take,  given  the  effectiveness  of 
the  drug  as  a  pain  killer. 

Conventional  wisdom  instructs  us  to  warn  Mr.  B  against  consulting  a  non¬ 
experimental  study  in  matters  of  decisions,  since  such  studies  are  marred  with 
uncontrolled  factors,  which  tend  to  bias  effect  estimates.  Specifically,  the  death 
rate  of  0.002  among  drug  users  may  be  indicative  of  low  tolerance  to  discomfort, 
or  of  membership  in  a  medically-informed  socio-economic  group.  Such  factors 
do  not  apply  to  Mr.  B,  who  did  not  use  the  drug  in  the  past  (be  it  by  choice, 
instinct  or  ignorance),  and  who  is  now  considering  switching  to  the  drug  by  ratio¬ 
nal  deliberation.  Conventional  wisdom  urges  us  to  refer  Mr.  B  to  the  randomized 
experimental  study  of  Table  7.2,  from  which  the  death  rate  under  controlled  ad¬ 
ministration  of  the  drug  was  evaluated  to  be  P(yx )  =  0.016,  eight  times  higher 
than  0.002. 

What  would  his  risk  of  death  be,  if  Mr.  B  decides  to  start  taking  the  drug? 
0.2  percent  or  1.6  percent? 

The  answer  is  that  neither  number  is  correct.  Mr.  B  cannot  be  treated  as  a 
random  patient  in  either  study,  because  his  history  of  not  using  the  drug  and  his 
survival  thus  far  puts  him  in  a  unique  category  of  patients,  for  which  the  effect  of 
the  drug  was  not  studied.12  These  two  attributes  provide  extra  evidence  about 
Mr.  B’s  sensitivity  to  the  drug.  This  became  clear  already  in  Example  1,  where 
we  discovered  definite  relationships  among  these  attributes  -  for  some  obscure 
reasons,  terminal  patients  chose  not  to  use  the  drug. 

12The  appropriate  experimental  design  for  measuring  the  risk  of  interest  is  to  conduct  a 
randomized  clinical  trial  on  patients  in  the  category  of  Mr.  B,  that  is,  to  subject  a  random 
sample  of  non-users  to  a  period  of  drug  treatment  and  measure  their  rate  of  survival. 
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To  properly  account  for  this  additional  evidence,  the  risk  should  be  measured 
through  the  counterfactual  expression  PS  —  P(yx\x' ,y')\  the  probability  that  a 
patient  who  survived  with  no  drug  would  have  died  had  he/she  taken  the  drug. 
The  appropriate  bound  for  this  probability  is  given  in  Eq.  (7.60): 

0.002  <  PS  <  0.031 

Thus,  Mr.  B’s  risk  of  death  (upon  switching  to  drug  usage)  can  be  as  high  as 
3.1  percent;  more  than  15  times  his  intuitive  estimate  of  0.2  percent,  and  almost 
twice  the  naive  estimate  obtained  from  the  experimental  study. 

However,  if  the  drug  can  safely  be  assumed  to  have  no  death-preventing  effects, 
then  monotonicity  applies,  and  the  appropriate  bound  is  given  by  Eq.  (7.63), 
PS  =  0.002,  which  coincides  with  Mr.  B’s  intuition. 


7.7  Conclusion 

This  chapter  shows  how  useful  information  about  probabilities  of  causation  can 
be  obtained  from  experimental  and  observational  studies,  with  weak  or  no  as¬ 
sumptions  about  the  data-generating  process.  We  have  shown  that,  in  general, 
bounds  for  the  probabilities  of  causation  can  be  obtained  from  combined  exper¬ 
imental  and  nonexperimental  data.  These  bounds  were  proven  to  be  sharp  and, 
therefore,  they  represent  the  ultimate  information  that  can  be  extracted  from 
statistical  methods.  We  have  further  illustrated  the  applicability  of  these  results 
to  problems  in  epidemiology  and  legal  reasoning,  and  we  have  clarified  the  two 
basic  assumptions  -  exogeneity  and  monotonicity  -  that  must  be  ascertained  be¬ 
fore  statistical  measures  such  as  excess-risk-ratio  could  represent  attributional 
quantities  such  as  probability  of  causation. 

It  is  appropriate  at  this  point  to  discuss  the  relation  between  the  assumptions 
in  the  example  of  Section  7.5  (where  we  have  population  probabilities  and  avail¬ 
able  experiments)  with  the  general  framework  with  which  the  chapter  begins 
(where  we  have  exogenous  variables  that  determine  everything  and  the  proba¬ 
bilities  enter  as  an  add-on  feature).  Traditional  statisticians  might  judge  the 
deterministic  model  incompatible  with  the  stochastic  nature  of  the  data,  and 
would  be  tempted  to  start  the  analysis  at  Section  7.3  (see  [RG89]  and  [FS99]), 
without  the  counterfactual  model  expounded  in  Section  7.2.  However,  traditional 
statistical  analysis  cannot  commence  without  explicating  the  quantity  we  wish 
to  estimate  (that  is,  PN),  for  which  we  have  no  empirical  data  and  for  which 
we  have  no  statistical  definition.  Instead,  our  target  quantity  is  defined  verbally 
by  law  makers  as  a  mixture  of  probabilistic  and  deterministic  components:  “it 
is  more  probable  than  not,  that  the  plaintiff  injury  would  not  have  occurred  but 
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for  the  defender  action” .  The  “more  probable  than  not”  criterion  is  probabilistic 
while  the  “but  for”  criterion  is  deterministic,  implying  eounterfactual  necessity. 

The  structural  approach  expounded  in  this  chapter  gives  a  clear  semantics 
to  this  mixture,  typical  of  eounterfactual  expressions,  and  relates  it  in  a  natural 
way  to  empirical  data.  The  stochastic  nature  of  the  data  is  viewed  as  emerging 
from  our  ignorance  of  the  detailed  experimental  conditions  that  prevailed  in  the 
study.  The  exogenous  variables  in  U  represent  these  missing  details,  and  include 
the  physiology  and  previous  history  of  each  person,  his/her  mental  and  spiritual 
attitude,  as  well  as  the  time  and  manner  in  which  the  exposure  occurred.  In 
short,  U  summarizes  all  the  factors  which  “determine”  in  the  classical  physical 
sense  the  outcome  of  the  study.  P(u)  summarizes  our  ignorance  of  those  factors. 

The  main  application  of  our  analysis  to  artificial  intelligence  lies  in  the  auto¬ 
matic  generation  of  causal  explanations,  where  the  distinction  between  necessary 
and  sufficient  causes  has  important  ramifications.  As  can  be  seen  from  the  defi¬ 
nitions  and  examples  discussed  in  this  chapter,  necessary  causation  is  a  concept 
tailored  to  a  specific  event  under  consideration  (singular  causation),  whereas  suffi¬ 
cient  causation  is  based  on  the  general  tendency  of  certain  event  types  to  produce 
other  event  types.  Adequate  explanations  should  respect  both  aspects.  If  we 
base  explanations  solely  on  generic  tendencies  (i.e. ,  sufficient  causation)  then  we 
lose  important  scenario-specific  information.  For  instance,  aiming  a  gun  at  and 
shooting  a  person  from  1,000  meters  away  will  not  qualify  as  an  explanation  for 
that  person’s  death,  owing  to  the  very  low  tendency  of  shots  fired  from  such  long 
distances  to  hit  their  marks.  This  stands  contrary  to  common  sense,  for  when  the 
shot  does  hit  its  mark  on  that  singular  day,  regardless  of  the  reason,  the  shooter 
is  an  obvious  culprit  for  the  consequence.  If,  on  the  other  hand,  we  base  expla¬ 
nations  solely  on  singular-event  considerations  (i.e.,  necessary  causation),  then 
ambient  factors  that  are  normally  present  in  the  world  would  awkwardly  qualify 
as  explanations.  For  example,  the  presence  of  oxygen  in  the  room  would  qualify 
as  an  explanation  for  the  fire  that  broke  out,  simply  because  the  fire  would  not 
have  occurred  were  it  not  for  the  oxygen.  That  we  judge  the  match  struck,  not 
the  oxygen,  to  be  the  more  adequate  explanation  of  the  fire  indicates  that  we  go 
beyond  necessity  considerations. 

Recasting  the  question  in  the  language  of  PN  and  PS,  we  note  that,  since 
both  explanations  are  necessary  for  the  fire,  each  will  command  a  PN  of  unity. 
(In  fact,  the  PN  is  actually  higher  for  the  oxygen  if  we  allow  for  alternative  ways 
of  igniting  a  spark).  Thus,  it  must  be  the  sufficiency  component  that  endows 
the  match  with  greater  explanatory  power  than  the  oxygen.  If  the  probabilities 
associated  with  striking  a  match  and  the  presence  of  oxygen  are  denoted  pm 
and  p0,  respectively,  then  the  PS  measures  associated  with  these  explanations 
evaluate  to  PS(match)  =  p0  and  PS(oxygen)  =  pm,  clearly  favoring  the  match 
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when  p0  >>  pm .  Thus,  a  robot  instructed  to  explain  why  a  fire  broke  out  has  no 
choice  but  to  consider  both  PN  and  PS  in  its  deliberations. 

Clearly,  some  balance  must  be  made  between  the  necessary  and  the  sufficient 
components  of  causal  explanation,  and  the  present  chapter  illuminates  this  bal¬ 
ance  by  formally  explicating  the  basic  relationships  between  the  two  components. 
In  Pearl  (2000,  chapter  10)  it  is  further  shown  that  PN  and  PS  are  too  crude 
for  capturing  probabilities  of  causation  in  multi-stage  scenarios,  and  that  the 
structure  of  the  intermediate  process  leading  from  cause  to  effect  must  enter  the 
definitions  of  causation  and  explanation.  Such  considerations  will  be  the  subject 
of  future  investigation  (See  [HP00]). 

Another  important  application  of  probabilities  of  causation  is  found  in  de¬ 
cision  making  problems,  such  as  those  encountered  in  medicine,  system  mainte¬ 
nance,  and  planning  under  uncertainty.  As  was  pointed  out  in  [PeaOO,  p.  217-219], 
the  counterfactual  lly  would  have  been  true  if  x  were  true”  can  often  be  translated 
into  a  conditional  action  claim  “given  that  currently  x  and  y  are  false,  y  will  be 
true  if  we  do  x.”  The  evaluation  of  such  conditional  predictions,  and  the  proba¬ 
bilities  of  such  predictions,  are  commonplace  in  decision  making  situations,  where 
actions  are  brought  into  focus  by  certain  eventualities  that  demand  remedial  cor¬ 
rection.  In  troubleshooting,  for  example,  we  observe  undesirable  effects  Y  —  y 
that  are  potentially  caused  by  other  conditions  X  =  x  and  we  wish  to  predict 
whether  an  action  that  brings  about  a  change  in  X  would  remedy  the  situation. 
The  information  provided  by  the  evidence  y  and  x  is  extremely  valuable,  and 
it  must  be  processed  (using  the  updated  distribution  P(u\x,y),  as  in  Eq.  (7.9)) 
before  we  can  predict  the  effect  of  any  action13.  Thus,  the  expressions  developed 
in  this  chapter  constitute  bounds  on  the  effectiveness  of  pending  policies,  when 
full  knowledge  of  the  current  state  of  affairs  (u)  is  not  available,  yet  the  current 
states  of  the  decision  variable  (X)  and  the  outcome  variable  (Y)  are  measured. 

For  these  bounds  to  be  valid  in  policy  making,  the  context  u  must  be  time- 
invariant,  that  is,  the  probability  P(u)  should  represent  epistemic  uncertainty 
about  a  static,  albeit  unknown  context  U  =  u.  The  constancy  of  u  is  well  jus¬ 
tified  in  the  control  and  diagnosis  of  physical  systems,  where  u  represents  fixed, 
but  unknown  physical  characteristics  of  devices  or  subsystems.  The  constancy 
approximation  is  also  justified  in  the  health  sciences  where  patients’  genetic  at¬ 
tributes  and  physical  characteristics  can  be  assumed  relatively  constant  between 
observation  and  treatment. 

The  constancy  assumption  is  less  justified  in  economic  systems,  where  agents 
are  bombarded  by  rapidly  fluctuating  stream  of  external  forces  (“shocks”  in 
econometric  terminology)  as  well  as  by  inter-agents  communication  messages. 

13Such  processing  have  been  applied  indeed  to  the  evaluation  of  economic  policies  [BP95] 
and  to  repair-test  strategies  in  troubleshooting  [BH96] 
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These  exogenous  factors  may  vary  substantially  during  the  policy  making  inter¬ 
val  and  they  require,  therefore,  time-dependent  analysis.  The  canonical  violation 
of  the  constancy  assumption  occurs,  of  course,  in  quantum  mechanical  systems, 
where  the  indeterminism  associated  with  U  is  “intrinsic”,  and  the  existence  of  a 
deterministic  relationship  between  U  and  V  is  no  longer  a  good  approximation. 
A  method  of  incorporating  such  intrinsic  indeterminism  into  counterfactual  anal¬ 
ysis  is  outlined  in  [PeaOO,  p.  220],  and  leads  to  Eq.  (7.9),  where  P(Yx>(u)  —  y') 
represents  the  intrinsic  uncertainty  in  Y  associated  with  the  macroscopic  state 
U  =  u,  under  the  action  do(X  =  x)  (see  footnote  6). 
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